bipa-g2p-byt5-small

ByT5-small fine-tuned for Grapheme-to-Phoneme (G2P) conversion in Brazilian Portuguese, producing transcriptions in the International Phonetic Alphabet (IPA). The model is conditioned on dialect tags and was trained on the BIPA dataset (derived from Wiktionary, CC BY-SA 4.0 license), covering six dialectal varieties of Brazilian Portuguese.

paper: BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard

Quick Summary

  • Input: short Brazilian Portuguese text, with or without a dialect tag.
  • Output: space-separated IPA symbols.
  • Architecture: google/byt5-small fine-tuned for seq2seq.
  • Main metric: PER (Phoneme Error Rate), including a minimum PER variant that accounts for multiple valid transcriptions per word.

Task

  • Grapheme → IPA conversion (short text-to-text mapping).
  • Support for multiple dialects via explicit tags at the beginning of the input sequence.

How to Use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

repo_id = "thiagomonteles/BIPA_g2p_Multidialect_Byt5"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)

g2p = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

# simple example (general Brazilian dialect)
out = g2p("[brasil] Gratidão", max_new_tokens=32)["generated_text"]
print(out)  

You can also use the generation API directly:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

repo_id = "thiagomonteles/BIPA_g2p_Multidialect_Byt5"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)

text = "[rio-de-janeiro] Gratidão"
inputs = tok(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=35, num_beams=6)
print(tok.decode(outputs, skip_special_tokens=True))

Specifying a Dialect

The model expects a dialect tag in the format [dialect_tag] word, where the tag is derived from the dialect name in lowercase with spaces replaced by hyphens.

# São Paulo
out_sp = g2p("[são-paulo] carro", max_new_tokens=32)["generated_text"]
print(out_sp)

# Rio de Janeiro
out_rj = g2p("[rio-de-janeiro] porta", max_new_tokens=32)["generated_text"]
print(out_rj)

# Brazil (general)
out_br = g2p("[brasil] pedra", max_new_tokens=32)["generated_text"]
print(out_br)

Supported Dialects

Tags used in the input (from the tag_dialeto column of the processed dataset):

  • [brasil] — general Brazilian Portuguese (mapped from Brazil)
  • [são-paulo] — São Paulo dialect
  • [rio-de-janeiro] — Rio de Janeiro dialect
  • [regiao-sul] — Southern Brazil dialect
  • [regiao-nordeste] — Northeast Brazil dialect
  • [regiao-centro-oeste] — Central-West Brazil dialect (rural)

If no tag is provided, behavior is not guaranteed; it is always recommended to prefix the input with an explicit tag.

Performance by Dialect

Model performance (Minimum PER) on the validation set, segmented by dialect. Results show a strong correlation between training data volume and transcription accuracy — minority dialects with fewer examples present significantly higher error rates.

Dialect No. Examples ↓ Min. PER (%)
São Paulo 1,141 1.93
Rio de Janeiro 2,373 3.08
Brazil (general) 5,316 6.07
South Region 2,131 2.36
Northeast Region 32 78.12
Center-West Region 7 42.86

Training Data

  • Source: BIPA (Brazilian IPA), built from Wiktionary entries with multiple phonetic transcriptions per word and dialect.

Limitations and Usage Warnings

  • Trained exclusively on Brazilian Portuguese; not suitable for other languages.
  • Transcriptions follow BIPA/Wiktionary-specific conventions and may differ from other phonetic standards.
  • Minority dialects: [regiao-nordeste] and [regiao-centro-oeste] are underrepresented in the BIPA dataset compared to [brasil], [são-paulo], and [rio-de-janeiro]. Expect lower transcription accuracy for these varieties.

How to Cite

If you use this model or the BIPA dataset, please cite the following paper:

Thiago Monteles de Sousa, Lucas Rafael Gris, and Nádia Félix Felipe da Silva. 2026. BIPA: Brazilian Portuguese Phonetic Dataset with Dialectal Variations in IPA Standard. In Proceedings of the 17th International Conference on Computational Processing of Portuguese (PROPOR 2026) - Vol. 1, pages 478–487, Salvador, Brazil. Association for Computational Linguistics.

  • Model: thiagomonteles/BIPA_g2p_Multidialect_Byt5
  • Dataset: thiagomonteles/BIPA
Downloads last month
49
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thiagomonteles/BIPA_g2p_Multidialect_Byt5

Finetuned
(241)
this model

Dataset used to train thiagomonteles/BIPA_g2p_Multidialect_Byt5