Italian — Wikilangs Models

Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on Italian Wikipedia by Wikilangs.

🌐 Language Page · 🎮 Playground · 📊 Full Research Report

Language Samples

Example sentences drawn from the Italian Wikipedia corpus:

Eventi, invenzioni e scoperte Personaggi nasce Dante Alighieri Altri progetti 07

Eventi, invenzioni e scoperte Periodo della Grande carestia del Personaggi Giovanni Boccaccio nasce nel luglio Altri progetti 02

Eventi, invenzioni e scoperte Fine della cattività avignonese A Vicenza venne sparato il primo fuoco d'artificio Europeo. Personaggi Altri progetti 08

Eventi, invenzioni e scoperte Personaggi ... Altri progetti 09

Eventi, invenzioni e scoperte Viene inventato il Lapis Benjamin Franklin inventa il Parafulmine. Personaggi Wolfgang Amadeus Mozart Altri progetti 06

Quick Start

Load the Tokenizer

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("it_tokenizer_32k.model")

text = "Eventi, invenzioni e scoperte Viene inventato il Lapis Benjamin Franklin inventa"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

print(tokens)  # subword pieces
print(ids)     # integer ids

# Decode back
print(sp.DecodeIds(ids))

Tokenization examples (click to expand)

Sample 1: Eventi, invenzioni e scoperte Viene inventato il Lapis Benjamin Franklin inventa…

Vocab	Tokens	Count
8k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁viene ▁inv entato ▁il ▁la … (+29 more)`	39
16k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁viene ▁inventato ▁il ▁la pis … (+21 more)`	31
32k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁viene ▁inventato ▁il ▁la pis … (+17 more)`	27
64k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁viene ▁inventato ▁il ▁la pis … (+17 more)`	27

Sample 2: Eventi, invenzioni e scoperte Roma - Inaugurazione del Colosseo Personaggi 81 Ro…

Vocab	Tokens	Count
8k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁roma ▁- ▁inaugu razione ▁del … (+19 more)`	29
16k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁roma ▁- ▁inaugu razione ▁del … (+19 more)`	29
32k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁roma ▁- ▁inaugurazione ▁del ▁colo … (+18 more)`	28
64k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁roma ▁- ▁inaugurazione ▁del ▁colosseo … (+16 more)`	26

Sample 3: Eventi, invenzioni e scoperte Fine della cattività avignonese A Vicenza venne sp…

Vocab	Tokens	Count
8k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁fine ▁della ▁ca tti vità … (+23 more)`	33
16k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁fine ▁della ▁ca ttività ▁avi … (+22 more)`	32
32k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁fine ▁della ▁ca ttività ▁avi … (+22 more)`	32
64k	`▁eventi , ▁invenzioni ▁e ▁scoperte ▁fine ▁della ▁cattività ▁avignon ese … (+18 more)`	28

Load Word Embeddings

from gensim.models import KeyedVectors

# Aligned embeddings (cross-lingual, mapped to English vector space)
wv = KeyedVectors.load("it_embeddings_128d_aligned.kv")

similar = wv.most_similar("word", topn=5)
for word, score in similar:
    print(f"  {word}: {score:.3f}")

Load N-gram Model

import pyarrow.parquet as pq

df = pq.read_table("it_3gram_word.parquet").to_pandas()
print(df.head())

Models Overview

Category	Assets
Tokenizers	BPE at 8k, 16k, 32k, 64k vocab sizes
N-gram models	2 / 3 / 4 / 5-gram (word & subword)
Markov chains	Context 1–5 (word & subword)
Embeddings	32d, 64d, 128d — mono & aligned
Vocabulary	Full frequency list + Zipf analysis
Statistics	Corpus & model statistics JSON

Metrics Summary

Component	Model	Key Metric	Value
Tokenizer	8k BPE	Compression	3.86x
Tokenizer	16k BPE	Compression	4.25x
Tokenizer	32k BPE	Compression	4.58x
Tokenizer	64k BPE	Compression	4.82x 🏆
N-gram	2-gram (subword)	Perplexity	214 🏆
N-gram	2-gram (word)	Perplexity	204,245
N-gram	3-gram (subword)	Perplexity	1,722
N-gram	3-gram (word)	Perplexity	980,193
N-gram	4-gram (subword)	Perplexity	10,064
N-gram	4-gram (word)	Perplexity	1,937,953
N-gram	5-gram (subword)	Perplexity	43,596
N-gram	5-gram (word)	Perplexity	1,090,157
Markov	ctx-1 (subword)	Predictability	0.0%
Markov	ctx-1 (word)	Predictability	0.0%
Markov	ctx-2 (subword)	Predictability	32.2%
Markov	ctx-2 (word)	Predictability	53.2%
Markov	ctx-3 (subword)	Predictability	27.9%
Markov	ctx-3 (word)	Predictability	79.8%
Markov	ctx-4 (subword)	Predictability	32.0%
Markov	ctx-4 (word)	Predictability	92.6% 🏆
Vocabulary	full	Size	511,837
Vocabulary	full	Zipf R²	0.9968
Embeddings	mono_32d	Isotropy	0.7834
Embeddings	mono_64d	Isotropy	0.7465
Embeddings	mono_128d	Isotropy	0.6690
Embeddings	aligned_32d	Isotropy	0.7834 🏆
Embeddings	aligned_64d	Isotropy	0.7465
Embeddings	aligned_128d	Isotropy	0.6690
Alignment	aligned_32d	R@1 / R@5 / R@10	39.2% / 64.2% / 74.8%
Alignment	aligned_64d	R@1 / R@5 / R@10	60.6% / 81.4% / 85.8%
Alignment	aligned_128d	R@1 / R@5 / R@10	67.8% / 88.8% / 93.4% 🏆

📊 Full ablation study, per-model breakdowns, and interpretation guide →

About

Trained on wikipedia-monthly — monthly snapshots of 300+ Wikipedia languages.

A project by Wikilangs · Maintainer: Omar Kamali · Omneity Labs

Citation

@misc{wikilangs2025,
  author    = {Kamali, Omar},
  title     = {Wikilangs: Open NLP Models for Wikipedia Languages},
  year      = {2025},
  doi       = {10.5281/zenodo.18073153},
  publisher = {Zenodo},
  url       = {https://huggingface.co/wikilangs},
  institution = {Omneity Labs}
}

wikilangs
/

it