BERTomelo

BERTomelo is a variant of BERT encoders built upon the ModernBERT architecture. It is pretrained from scratch on the Classified Common Crawl Corpus for Portuguese (ClassiCC PT), specifically tailored and specialized for the Portuguese language. BERTomelo try to fill the gap of older Brazilian BERT models with the native long context capability, making it ideal for tasks that require processing extensive documents, mirroring the efficiency of the ModernBERT reference architecture.

Architectural Improvements

BERTomelo leverages several recent advancements in transformer design integrated into the ModernBERT architecture, such as:

  • Rotary Positional Embeddings (RoPE): Facilitates robust long context support by encoding positional information through rotation, allowing for better extrapolation to larger sequence lengths.
  • Local Global Alternating Attention: Optimizes computational efficiency by alternating between local attention (focusing on neighboring tokens) and global attention (capturing dependencies across the entire sequence).
  • Unpadding and FlashAttention 2: Enhances hardware level efficiency by removing unnecessary padding tokens and utilizing an IO aware attention implementation to maximize token throughput.

The pretraining of BERTomelo encompasses 386.48 billion tokens. The custom tokenizer is sourced from ModBERTBr, trained using a Unigram algorithm on the BrWaC corpus and the Wikipedia Portuguese dataset.

Available Models

Model Variant Layers Parameters Context Window
BERTomelo Base 22 136 Million 1,024 tokens
BERTomelo Large 28 377 Million 1,024 tokens

Usage

You can use BERTomelo leveraging the transformers library since version v4.48.0:

pip install -U transformers>=4.48.0

Since BERTomelo is pretrained using a Masked Language Model (MLM) objective, it can be used directly for inference via AutoModelForMaskedLM or the fill-mask pipeline. It also serves as a robust backbone for fine tuning on downstream tasks such as classification, textual similarity, retrieval, or question answering.

⚠️ Note: We recommend using FlashAttention 2 to achieve the highest processing efficiency and throughput.

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "unb-labia/BERTomelo-ModernBERT-Large-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "O cachorro mais famoso do brasil é o vira [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="unb-labia/BERTomelo-ModernBERT-Large-v1",
)
input_text = "O Brasil está presente na [MASK] do sul."
results = pipe(input_text)

for result in results:
    print(f"Token: {result['token_str']} | Score: {result['score']:.4f}")

Note: Unlike traditional BERT models, the ModernBERT architecture utilized by BERTomelo does not require token type IDs. You can omit this parameter in most downstream tasks.

Evaluation

The following table summarizes the performance of BERTomelo compared to other prominent models for Brazilian Portuguese across Semantic Textual Similarity (STS), Recognizing Textual Entailment (RTE), and Named Entity Recognition (NER).

Model STS MSE ↓ STS Pearson ↑ RTE F1 ↑ RTE Acc ↑ NER F1 ↑ NER Recall ↑ NER Prec ↑
mBERT 0.597 0.801 84.45% 84.52% 88.50% 90.45% 86.63%
BERTimbau 0.580 0.836 89.20% 89.20% 90.48% 91.64% 89.35%
BERTuguês 0.583 0.823 86.27% 86.40% 89.56% 90.59% 88.55%
ModernBERT 0.514 0.790 81.09% 81.17% 75.76% 78.03% 73.61%
ModBERTBr 0.509 0.812 85.28% 85.42% 90.08% 92.40% 87.88%
BERTomelo Base 0.425 0.833 87.65% 87.70% 91.37% 92.29% 90.46%
BERTimbau Large 0.500 0.852 90.04% 90.04% 90.54% 92.24% 88.90%
Albertina 900M 0.570 0.853 89.09% 89.09% 87.28% 91.43% 83.49%
BERTomelo Large 0.401 0.849 89.21% 89.26% 91.05% 92.75% 89.42%

Limitations

  • The evaluation process lacks the availability of standardized long context benchmarks specifically designed for the Portuguese language.
  • The current release does not yet incorporate the sequence packing mechanism, which may affect token throughput in specific training scenarios.
  • As with all models trained on large scale web data, BERTomelo may reflect biases present in the ClassiCC PT dataset.

Acknowledgments

This work was supported in part by Advanced Micro Devices, Inc. under the AMD AI & HPC Cluster Program. Furthermore, the respective authors are appreciated for providing the ClassiCC-PT, ASSIN2, and LeNER-BR datasets, as well as the developers of ModBERTBr for their foundational work in adapting this architecture for the Brazilian Portuguese language.

Downloads last month
140
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unb-labia/BERTomelo-ModernBERT-Large-v1

Finetuned
(271)
this model

Dataset used to train unb-labia/BERTomelo-ModernBERT-Large-v1