Instructions to use thiagomonteles/BIPA_g2p_Multidialect_Byt5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thiagomonteles/BIPA_g2p_Multidialect_Byt5 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("thiagomonteles/BIPA_g2p_Multidialect_Byt5") model = AutoModelForSeq2SeqLM.from_pretrained("thiagomonteles/BIPA_g2p_Multidialect_Byt5") - Notebooks
- Google Colab
- Kaggle
bipa-g2p-byt5-small
ByT5-small fine-tuned for Grapheme-to-Phoneme (G2P) conversion in Brazilian Portuguese, producing transcriptions in the International Phonetic Alphabet (IPA). The model is conditioned on dialect tags and was trained on the BIPA dataset (derived from Wiktionary, CC BY-SA 4.0 license), covering six dialectal varieties of Brazilian Portuguese.
paper: BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard
Quick Summary
- Input: short Brazilian Portuguese text, with or without a dialect tag.
- Output: space-separated IPA symbols.
- Architecture:
google/byt5-smallfine-tuned for seq2seq. - Main metric: PER (Phoneme Error Rate), including a minimum PER variant that accounts for multiple valid transcriptions per word.
Task
- Grapheme → IPA conversion (short text-to-text mapping).
- Support for multiple dialects via explicit tags at the beginning of the input sequence.
How to Use
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
repo_id = "thiagomonteles/BIPA_g2p_Multidialect_Byt5"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)
g2p = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
# simple example (general Brazilian dialect)
out = g2p("[brasil] Gratidão", max_new_tokens=32)["generated_text"]
print(out)
You can also use the generation API directly:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
repo_id = "thiagomonteles/BIPA_g2p_Multidialect_Byt5"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)
text = "[rio-de-janeiro] Gratidão"
inputs = tok(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=35, num_beams=6)
print(tok.decode(outputs, skip_special_tokens=True))
Specifying a Dialect
The model expects a dialect tag in the format [dialect_tag] word, where the tag is derived from the dialect name in lowercase with spaces replaced by hyphens.
# São Paulo
out_sp = g2p("[são-paulo] carro", max_new_tokens=32)["generated_text"]
print(out_sp)
# Rio de Janeiro
out_rj = g2p("[rio-de-janeiro] porta", max_new_tokens=32)["generated_text"]
print(out_rj)
# Brazil (general)
out_br = g2p("[brasil] pedra", max_new_tokens=32)["generated_text"]
print(out_br)
Supported Dialects
Tags used in the input (from the tag_dialeto column of the processed dataset):
[brasil]— general Brazilian Portuguese (mapped fromBrazil)[são-paulo]— São Paulo dialect[rio-de-janeiro]— Rio de Janeiro dialect[regiao-sul]— Southern Brazil dialect[regiao-nordeste]— Northeast Brazil dialect[regiao-centro-oeste]— Central-West Brazil dialect (rural)
If no tag is provided, behavior is not guaranteed; it is always recommended to prefix the input with an explicit tag.
Performance by Dialect
Model performance (Minimum PER) on the validation set, segmented by dialect. Results show a strong correlation between training data volume and transcription accuracy — minority dialects with fewer examples present significantly higher error rates.
| Dialect | No. Examples | ↓ Min. PER (%) |
|---|---|---|
| São Paulo | 1,141 | 1.93 |
| Rio de Janeiro | 2,373 | 3.08 |
| Brazil (general) | 5,316 | 6.07 |
| South Region | 2,131 | 2.36 |
| Northeast Region | 32 | 78.12 |
| Center-West Region | 7 | 42.86 |
Training Data
- Source: BIPA (Brazilian IPA), built from Wiktionary entries with multiple phonetic transcriptions per word and dialect.
Limitations and Usage Warnings
- Trained exclusively on Brazilian Portuguese; not suitable for other languages.
- Transcriptions follow BIPA/Wiktionary-specific conventions and may differ from other phonetic standards.
- Minority dialects:
[regiao-nordeste]and[regiao-centro-oeste]are underrepresented in the BIPA dataset compared to[brasil],[são-paulo], and[rio-de-janeiro]. Expect lower transcription accuracy for these varieties.
How to Cite
If you use this model or the BIPA dataset, please cite the following paper:
Thiago Monteles de Sousa, Lucas Rafael Gris, and Nádia Félix Felipe da Silva. 2026. BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 478–487, Salvador, Brazil. Association for Computational Linguistics.
- Model:
thiagomonteles/BIPA_g2p_Multidialect_Byt5 - Dataset:
thiagomonteles/BIPA
- Downloads last month
- 49
Model tree for thiagomonteles/BIPA_g2p_Multidialect_Byt5
Base model
google/byt5-small