BERTomelo
BERTomelo is a variant of BERT encoders built upon the ModernBERT architecture. It is pretrained from scratch on the Classified Common Crawl Corpus for Portuguese (ClassiCC PT), specifically tailored and specialized for the Portuguese language. BERTomelo try to fill the gap of older Brazilian BERT models with the native long context capability, making it ideal for tasks that require processing extensive documents, mirroring the efficiency of the ModernBERT reference architecture.
Architectural Improvements
BERTomelo leverages several recent advancements in transformer design integrated into the ModernBERT architecture, such as:
- Rotary Positional Embeddings (RoPE): Facilitates robust long context support by encoding positional information through rotation, allowing for better extrapolation to larger sequence lengths.
- Local Global Alternating Attention: Optimizes computational efficiency by alternating between local attention (focusing on neighboring tokens) and global attention (capturing dependencies across the entire sequence).
- Unpadding and FlashAttention 2: Enhances hardware level efficiency by removing unnecessary padding tokens and utilizing an IO aware attention implementation to maximize token throughput.
The pretraining of BERTomelo encompasses 386.48 billion tokens. The custom tokenizer is sourced from ModBERTBr, trained using a Unigram algorithm on the BrWaC corpus and the Wikipedia Portuguese dataset.
Available Models
| Model Variant | Layers | Parameters | Context Window |
|---|---|---|---|
| BERTomelo Base | 22 | 136 Million | 1,024 tokens |
| BERTomelo Large | 28 | 377 Million | 1,024 tokens |
Usage
You can use BERTomelo leveraging the transformers library since version v4.48.0:
pip install -U transformers>=4.48.0
Since BERTomelo is pretrained using a Masked Language Model (MLM) objective, it can be used directly for inference via AutoModelForMaskedLM or the fill-mask pipeline. It also serves as a robust backbone for fine tuning on downstream tasks such as classification, textual similarity, retrieval, or question answering.
⚠️ Note: We recommend using FlashAttention 2 to achieve the highest processing efficiency and throughput.
Using AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "unb-labia/BERTomelo-ModernBERT-Large-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "O cachorro mais famoso do brasil é o vira [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
Using a pipeline:
import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
"fill-mask",
model="unb-labia/BERTomelo-ModernBERT-Large-v1",
)
input_text = "O Brasil está presente na [MASK] do sul."
results = pipe(input_text)
for result in results:
print(f"Token: {result['token_str']} | Score: {result['score']:.4f}")
Note: Unlike traditional BERT models, the ModernBERT architecture utilized by BERTomelo does not require token type IDs. You can omit this parameter in most downstream tasks.
Evaluation
The following table summarizes the performance of BERTomelo compared to other prominent models for Brazilian Portuguese across Semantic Textual Similarity (STS), Recognizing Textual Entailment (RTE), and Named Entity Recognition (NER).
| Model | STS MSE ↓ | STS Pearson ↑ | RTE F1 ↑ | RTE Acc ↑ | NER F1 ↑ | NER Recall ↑ | NER Prec ↑ |
|---|---|---|---|---|---|---|---|
| mBERT | 0.597 | 0.801 | 84.45% | 84.52% | 88.50% | 90.45% | 86.63% |
| BERTimbau | 0.580 | 0.836 | 89.20% | 89.20% | 90.48% | 91.64% | 89.35% |
| BERTuguês | 0.583 | 0.823 | 86.27% | 86.40% | 89.56% | 90.59% | 88.55% |
| ModernBERT | 0.514 | 0.790 | 81.09% | 81.17% | 75.76% | 78.03% | 73.61% |
| ModBERTBr | 0.509 | 0.812 | 85.28% | 85.42% | 90.08% | 92.40% | 87.88% |
| BERTomelo Base | 0.425 | 0.833 | 87.65% | 87.70% | 91.37% | 92.29% | 90.46% |
| BERTimbau Large | 0.500 | 0.852 | 90.04% | 90.04% | 90.54% | 92.24% | 88.90% |
| Albertina 900M | 0.570 | 0.853 | 89.09% | 89.09% | 87.28% | 91.43% | 83.49% |
| BERTomelo Large | 0.401 | 0.849 | 89.21% | 89.26% | 91.05% | 92.75% | 89.42% |
Limitations
- The evaluation process lacks the availability of standardized long context benchmarks specifically designed for the Portuguese language.
- The current release does not yet incorporate the sequence packing mechanism, which may affect token throughput in specific training scenarios.
- As with all models trained on large scale web data, BERTomelo may reflect biases present in the ClassiCC PT dataset.
Acknowledgments
This work was supported in part by Advanced Micro Devices, Inc. under the AMD AI & HPC Cluster Program. Furthermore, the respective authors are appreciated for providing the ClassiCC-PT, ASSIN2, and LeNER-BR datasets, as well as the developers of ModBERTBr for their foundational work in adapting this architecture for the Brazilian Portuguese language.
- Downloads last month
- 140
Model tree for unb-labia/BERTomelo-ModernBERT-Large-v1
Base model
answerdotai/ModernBERT-large