BERTomelo

BERTomelo is a variant of BERT encoders built upon the ModernBERT architecture. It is pretrained from scratch on the Classified Common Crawl Corpus for Portuguese (ClassiCC PT), specifically tailored and specialized for the Portuguese language. BERTomelo try to fill the gap of older Brazilian BERT models with the native long context capability, making it ideal for tasks that require processing extensive documents, mirroring the efficiency of the ModernBERT reference architecture.

Architectural Improvements

BERTomelo leverages several recent advancements in transformer design integrated into the ModernBERT architecture, such as:

Rotary Positional Embeddings (RoPE): Facilitates robust long context support by encoding positional information through rotation, allowing for better extrapolation to larger sequence lengths.
Local Global Alternating Attention: Optimizes computational efficiency by alternating between local attention (focusing on neighboring tokens) and global attention (capturing dependencies across the entire sequence).
Unpadding and FlashAttention 2: Enhances hardware level efficiency by removing unnecessary padding tokens and utilizing an IO aware attention implementation to maximize token throughput.

The pretraining of BERTomelo encompasses 386.48 billion tokens. The custom tokenizer is sourced from ModBERTBr, trained using a Unigram algorithm on the BrWaC corpus and the Wikipedia Portuguese dataset.

Available Models

Model Variant	Layers	Parameters	Context Window
BERTomelo Base	22	136 Million	1,024 tokens
BERTomelo Large	28	377 Million	1,024 tokens

Usage

You can use BERTomelo leveraging the transformers library since version v4.48.0:

pip install -U transformers>=4.48.0

Since BERTomelo is pretrained using a Masked Language Model (MLM) objective, it can be used directly for inference via AutoModelForMaskedLM or the fill-mask pipeline. It also serves as a robust backbone for fine tuning on downstream tasks such as classification, textual similarity, retrieval, or question answering.

⚠️ Note: We recommend using FlashAttention 2 to achieve the highest processing efficiency and throughput.

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "unb-labia/BERTomelo-ModernBERT-Large-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "O cachorro mais famoso do brasil é o vira [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="unb-labia/BERTomelo-ModernBERT-Large-v1",
)
input_text = "O Brasil está presente na [MASK] do sul."
results = pipe(input_text)

for result in results:
    print(f"Token: {result['token_str']} | Score: {result['score']:.4f}")

Note: Unlike traditional BERT models, the ModernBERT architecture utilized by BERTomelo does not require token type IDs. You can omit this parameter in most downstream tasks.

Evaluation

The following table summarizes the performance of BERTomelo compared to other prominent models for Brazilian Portuguese across Semantic Textual Similarity (STS), Recognizing Textual Entailment (RTE), and Named Entity Recognition (NER).

Model	STS MSE ↓	STS Pearson ↑	RTE F1 ↑	RTE Acc ↑	NER F1 ↑	NER Recall ↑	NER Prec ↑
mBERT	0.597	0.801	84.45%	84.52%	88.50%	90.45%	86.63%
BERTimbau	0.580	0.836	89.20%	89.20%	90.48%	91.64%	89.35%
BERTuguês	0.583	0.823	86.27%	86.40%	89.56%	90.59%	88.55%
ModernBERT	0.514	0.790	81.09%	81.17%	75.76%	78.03%	73.61%
ModBERTBr	0.509	0.812	85.28%	85.42%	90.08%	92.40%	87.88%
BERTomelo Base	0.425	0.833	87.65%	87.70%	91.37%	92.29%	90.46%
BERTimbau Large	0.500	0.852	90.04%	90.04%	90.54%	92.24%	88.90%
Albertina 900M	0.570	0.853	89.09%	89.09%	87.28%	91.43%	83.49%
BERTomelo Large	0.401	0.849	89.21%	89.26%	91.05%	92.75%	89.42%

Limitations

The evaluation process lacks the availability of standardized long context benchmarks specifically designed for the Portuguese language.
The current release does not yet incorporate the sequence packing mechanism, which may affect token throughput in specific training scenarios.
As with all models trained on large scale web data, BERTomelo may reflect biases present in the ClassiCC PT dataset.

Acknowledgments

This work was supported in part by Advanced Micro Devices, Inc. under the AMD AI & HPC Cluster Program. Furthermore, the respective authors are appreciated for providing the ClassiCC-PT, ASSIN2, and LeNER-BR datasets, as well as the developers of ModBERTBr for their foundational work in adapting this architecture for the Brazilian Portuguese language.

Downloads last month: 140

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for unb-labia/BERTomelo-ModernBERT-Large-v1

Base model

answerdotai/ModernBERT-large

Finetuned

(271)

this model

unb-labia
/

BERTomelo-ModernBERT-Large-v1