NaolBM/african-corpus
Viewer โข Updated โข 35.4M โข 121
Comparison against Gemma-3 and Qwen3 tokenizers:
| Language | Africa-BBPE | Gemma-3 | Qwen-3 | Winner |
|---|---|---|---|---|
| Amharic | 4 | 15 | 18 | ๐ช๐น Africa-BBPE |
| Swahili | 8 | 10 | 12 | ๐ฐ๐ช Africa-BBPE |
| Hausa | 11 | 12 | 12 | ๐ณ๐ฌ Africa-BBPE |
| Oromo | 5 | 11 | 10 | ๐ช๐น Africa-BBPE |
| Yoruba | 9 | 8 | 8 | ๐ณ๐ฌ Gemma-3 |
| Tigrinya | 5 | 13 | 21 | ๐ช๐ท Africa-BBPE |
| English | 7 | 4 | 3 | ๐ฌ๐ง Qwen-3 |
| Code-switching | 10 | 17 | 21 | ๐ Africa-BBPE |
| Metric | Africa-BBPE | Gemma-3 | Qwen-3 |
|---|---|---|---|
| ๐ Wins | 6 | 1 | 1 |
| ๐ Total Tokens | 59 | 90 | 105 |
| โก Avg Tokens/Sample | 7.38 | 11.25 | 13.13 |
| Language Family | Africa-BBPE | Gemma-3 | Qwen-3 |
|---|---|---|---|
| Semitic (Ge'ez) | 4.5 | 14.0 | 19.5 |
| Cushitic | 5.0 | 11.0 | 10.0 |
| Bantu | 8.0 | 10.0 | 12.0 |
| Chadic | 11.0 | 12.0 | 12.0 |
| Benue-Congo | 9.0 | 8.0 | 8.0 |
| Germanic | 7.0 | 4.0 | 3.0 |
| Code-switching | 10.0 | 17.0 | 21.0 |
| Language | Code | Script | Tokenization Efficiency |
|---|---|---|---|
| Amharic | am |
Ge'ez | โญโญโญโญโญ (4 tokens avg) |
| Tigrinya | ti |
Ge'ez | โญโญโญโญโญ (5 tokens avg) |
| Oromo | om |
Latin | โญโญโญโญโญ (5 tokens avg) |
| Swahili | sw |
Latin | โญโญโญโญ (8 tokens avg) |
| Hausa | ha |
Latin | โญโญโญ (11 tokens avg) |
| Yoruba | yo |
Latin | โญโญโญ (9 tokens avg) |
| English | en |
Latin | โญโญ (7 tokens avg) |
| Code-switching | Mixed | Mixed | โญโญโญโญโญ (10 tokens avg) |
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("NaolBM/Africa-BBPE")
# Example usage
text = "แ แแญแ แแแ แ แขแตแฎแตแซ"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(f"Tokens: {tokens}")
print(f"Token IDs: {ids}")
print(f"Number of tokens: {len(ids)}")
| Language | Rows | Percentage |
|---|---|---|
| Swahili | 14,125,925 | 39.97% |
| Amharic | 10,815,255 | 30.60% |
| Hausa | 7,144,077 | 20.21% |
| English | 2,119,719 | 6.00% |
| Oromo | 881,450 | 2.49% |
| Yoruba | 245,837 | 0.70% |
| Tigrinya | 12,076 | 0.03% |
Compared to Gemma-3:
Compared to Qwen-3:
MIT