tifilBERT-Base: Highly Efficient Turkish Sentence Encoder

tifilBERT-Base is a highly efficient Turkish sentence encoder and feature extractor developed by Şuayp Talha Kocabay under the TÜBİTAK Science High School AI Club (TFLai).

Designed to overcome the "anisotropy" (vector collapse) problem common in standard Masked Language Models (MLM), tifilBERT was engineered with a specific goal: To create a perfectly distributed, highly meaningful vector space for the Turkish language. Unlike traditional models that are pre-trained from scratch specifically for Turkish (requiring massive GPU clusters and hundreds of compute hours), tifilBERT leverages an efficient cross-lingual transfer learning strategy. By building upon the robust foundation of mmBERT-small, tifilBERT achieved its remarkable performance using only 8 GPU hours of language adaptation on a single NVIDIA B200, followed by a 10-minute contrastive alignment phase on an A100. This approach proves that strategic knowledge transfer can fiercely compete with, and often outperform, models trained from scratch.

🏆 Benchmarks

Linear Probing Across 11 Benchmarks

Evaluated using the highly efficient encoder-fast-eval framework across 11 diverse Turkish NLP tasks (Linear Probing), tifilBERT showcased extraordinary semantic alignment and reasoning capabilities.

Despite its extremely constrained compute budget (a fraction of the resources used by its competitors), tifilBERT massively outperforms its base architecture (mmBERT-small) and fiercely competes with heavily scaled, trillion-token models which are trained from scratch specifically for Turkish.

Task / Dataset	Metric	mmBERT-small	ytu-ce-cosmos/modernbert-tr-base-1k	TFLai/tifilBERT-Base	boun-tabilab/TabiBERT
turkish-wikiNER	F1	0.4851	0.6000	0.5679	0.3939
Beyazperde Reviews	R2	0.2849	0.5174	0.4823	0.3287
SentiTurca (E-commerce)	F1	0.5005	0.6091	0.5835	0.5225
SentiTurca (Hate)	F1	0.3633	0.4949	0.4746	0.4420
SentiTurca (Movies)	F1	0.7181	0.8043	0.7848	0.7570
TrCOLA	F1	0.5767	0.6876	0.5155	0.6380
TurkishHateMap	F1	0.3512	0.4706	0.4496	0.4320
BuyukSinema	F1	0.1743	0.2434	0.2331	0.1596
Sinefil Reviews	R2	0.1600	0.3214	0.3034	0.1734
Vitamins/Supplements Reviews	R2	0.3550	0.5693	0.5749	0.4063
Offenseval2020_tr	F1	0.6233	0.7879	0.7086	0.6684
Average	-	0.4175	0.5551	0.5161	0.4474

Massive Leap from Base: tifilBERT demonstrates a gigantic leap from its foundational mmBERT-small architecture across all metrics, almost doubling the R2 performance on regression tasks like Beyazperde and Sinefil.
Outperforming Established Domain Models: When evaluated with frozen representations (Linear Probing), tifilBERT comfortably surpasses the well-established TabiBERT on average (0.5161 vs 0.4474). It shows dramatic superiority in tasks requiring granular semantic separation, such as regression mapping for reviews (e.g., Sinefil: 0.3034 vs 0.1734) and NER (0.5679 vs 0.3939).
Punching Above its Weight Class: In continuous value prediction tasks (Regression), tifilBERT's mathematically aligned vector space shines brilliantly. It surpasses models trained from scratch on massive Turkish corpora (like ytu-ce-cosmos/modernbert-tr-base-1k) on the Vitamins/Supplements dataset (0.5749 vs 0.5693), proving the raw power of Supervised Contrastive Learning in adapting pre-trained multilingual knowledge to specific target languages.
Semantics Over Syntax: The only benchmark where TabiBERT holds a clear lead over tifilBERT is TrCOLA (Linguistic Acceptability). This perfectly aligns with our architectural choices: tifilBERT heavily prioritizes extracting deep conceptual meaning over checking rigid syntactic grammar, a direct and intended result of its contrastive alignment phase.

TabiBench Results Across 14 Subtasks

While the benchmarks in the previous section utilized Linear Probing to test frozen embeddings, the TabiBench suite evaluates the model's performance under Full Fine-Tuning conditions. This provides a clear picture of tifilBERT's adaptability as a pre-trained backbone when its entire weight architecture is optimized for a specific Turkish NLP challenge.

🎯 General Summary Table (Comparison of All Task Groups)

A collective view of the weighted averages of all models across the 5 main task groups:

Task Group	Metric	tifilBERT-Base	BERTurk	TabiBERT
Set 1 (Medical/Academic)	Macro F1	69.70%	70.90%	71.91%
Set 2 (NLI)	Macro F1	83.87%	84.33%	84.51%
Set 3 (QA)	F1 Score	77.14%	60.16%	69.71%
Set 4 (STS)	Pearson	86.00%	85.33%	84.75%
Set 5 (General Classification)	Macro F1	84.17%	83.42%	83.44%

Set 1 (Academic / Medical Text Classification) - Metric: Macro F1

Dataset	Number of Samples	tifilBERT-Base	BERTurk	TabiBERT
PubMed RCT	1,500	0.7416	0.7561	0.7532
Sci-Cite TR	1,559	0.8077	0.8160	0.8329
Thesis Abstract	1,683	0.4885	0.4920	0.5077
MedNLI	1,422	0.7752	0.7990	0.8085
Weighted Average	6,164	0.6970	0.7090	0.7191

Set 2 (Natural Language Inference - NLI) - Metric: Macro F1

Dataset	Number of Samples	tifilBERT-Base	BERTurk	TabiBERT
MultiNLI	4,923	0.7944	0.7857	0.8060
SNLI	9,824	0.8608	0.8721	0.8647
Weighted Average	14,747	0.8387	0.8433	0.8451

Set 3 (Question Answering - QA) - Metric: F1 Score

Dataset	Number of Samples	tifilBERT-Base	BERTurk	TabiBERT
TQuAD	2,520	0.8011	0.6330	0.7234
XQuAD	179	0.3514	0.1596	0.3261
Weighted Average	2,699	0.7714	0.6016	0.6971

Set 4 (Semantic Textual Similarity - STS) - Metric: Pearson Correlation

Dataset	Number of Samples	tifilBERT-Base	BERTurk	TabiBERT
SICK-TR	4,927	0.8584	0.8595	0.8500
STSb-TR	1,379	0.8654	0.8312	0.8384
Weighted Average	6,306	0.8600	0.8533	0.8475

Set 5 (General Text Classification) - Metric: Macro F1

Dataset	Number of Samples	tifilBERT-Base	BERTurk	TabiBERT
NewsCat	250	0.9598	0.9560	0.9520
BilTweetNews	150	0.4804	0.5787	0.5011
GenderHateSpeech	2,000	0.6808	0.6825	0.6901
ProductReviews	35,275	0.8515	0.8430	0.8432
Weighted Average	37,675	0.8417	0.8342	0.8344

💡 Key Insights & Takeaways from TabiBench

The Full Fine-Tuning results on TabiBench perfectly illustrate the success of tifilBERT's non-traditional training pipeline:

Dominance in Complex Reasoning (QA & STS): tifilBERT achieves a massive lead in Question Answering (+7.43% over TabiBERT) and Semantic Textual Similarity (+1.25% over the average). This directly validates our Phase 2 (Supervised Contrastive Learning) strategy. By explicitly destroying vector space anisotropy, the model captures profound semantic nuances that standard MLM models miss, making it exceptionally powerful for retrieval and reasoning tasks.
Highly Competitive in Domain-Specific Tasks: Even in highly specialized domains like Medical/Academic texts (Set 1) and NLI (Set 2), tifilBERT performs neck-and-neck with heavily scaled, trillion-token models specifically trained from scratch for Turkish.
The Power of Transfer Learning: The ultimate takeaway is the efficiency of cross-lingual adaptation. tifilBERT demonstrates that instead of pre-training a language model from scratch (which requires massive datasets and thousands of GPU hours), applying rigorous mathematical alignment and strategic cross-lingual knowledge transfer to a strong base model in under 9 hours can yield representations that fiercely rival, and in many domains surpass, natively trained Turkish models.

📌 A Note on Reproducibility & Evaluation Dynamics

Please note that when reproducing these downstream tasks or fine-tuning the model for your own use cases, minor variations in the final metrics are completely normal and expected. Differences in fine-tuning hyperparameters (like learning rate and batch size), random seed initializations, and specific hardware training dynamics can lead to slight fluctuations.

🧠 Training Methodology (Efficient Cross-Lingual Transfer)

Instead of relying on sheer compute power, tifilBERT was trained using a highly strategic, two-phase pipeline to maximize hardware efficiency:

Phase 1: Continual Pre-Training & Cross-Lingual Transfer (TLM + MLM)

Hardware Compute (Adaptation Phase): ~8 Hours on 1x NVIDIA B200 (excluding the original pre-training compute of the base mmBERT-small model).
Base Architecture: Initialized from the highly capable mmBERT-small.
Objective: Transfer the rich, conceptual English intelligence of the base model into Turkish.
Data: A massive 2.8M row interleaved dataset combining Diffutron Pretraining Corpus (Pure Turkish Wikipedia/News/Web, 70%) and OPUS-100 (English-Turkish Translation Pairs, 30%).
Mechanism: By using Translation Language Modeling (TLM) with aggressive random masking, the model was forced to build deep mathematical bridges between English and Turkish concepts.

Phase 2: Anisotropy Destruction (Supervised Contrastive Learning)

Hardware Compute: ~10 Minutes on 1x NVIDIA A100 (80GB).
Objective: Standard MLM models tend to cluster sentence vectors too closely. We needed to shatter this space to create distinct semantic boundaries.
Data: emrecan/all-nli-tr (Anchor, Positive, Negative Triplets).
Mechanism: Trained using Multiple Negatives Ranking Loss (MNRL) with a massive batch size of 512. In a single forward pass, the model pushed one anchor sentence towards its positive pair while simultaneously repelling it from 511 negative sentences.
Result: A perfectly crystallized vector space where semantic differences (sentiment, intent, logical contradiction) are separated by massive distances, making the embeddings incredibly robust.

💻 Usage

tifilBERT is primarily designed for generating highly meaningful sentence embeddings, but its perfectly aligned vector space makes it an exceptional backbone for any NLP task.

Option 1: Sentence Transformers (Recommended for Embeddings/RAG)

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("TFLai/tifilBERT-Base")

# Sentences we want to encode
sentences = [
    "Hakem maçı mükemmel yönetti, kararları çok isabetliydi.",
    "Hakem maçı mükemmel katletti, tüm kararları taraflıydı.",
    "Yarın hava güneşli olacakmış."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Calculate similarities (Cosine Similarity)
similarity_1_2 = model.similarity(embeddings[0], embeddings[1])
similarity_1_3 = model.similarity(embeddings[0], embeddings[2])

print(f"Similarity (Positive vs Negative Review): {similarity_1_2.item():.4f}")
print(f"Similarity (Review vs Weather): {similarity_1_3.item():.4f}")

Option 2: Fine-Tuning for Downstream Tasks (Classification, Regression, NER)

Beyond zero-shot feature extraction and Linear Probing, tifilBERT serves as an exceptionally strong, pre-aligned backbone for full fine-tuning. You can attach task-specific heads (using AutoModelForSequenceClassification or AutoModelForTokenClassification) and fine-tune the model for Text Classification (sentiment, hate speech), Token Classification (NER, POS), and Regression tasks. Because the vector space is already highly organized, the model requires significantly fewer epochs to converge during fine-tuning compared to standard MLM models.

Option 3: Hugging Face Transformers (Raw Access)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("TFLai/tifilBERT-Base")
model = AutoModel.from_pretrained("TFLai/tifilBERT-Base")

sentences = ["Yapay zeka modelleri gün geçtikçe küçülüyor ve hızlanıyor."]

# Tokenize
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get hidden states
with torch.no_grad():
    outputs = model(**inputs)

# Extract [CLS] token embedding for sentence representation
sentence_embedding = outputs.last_hidden_state[:, 0, :]
print(sentence_embedding.shape) # Output: torch.Size([1, hidden_size])

⚠️ Limitations & Bias

Context Length: The model is highly optimized for short-to-medium text spans (max 512 tokens). For extremely long documents (e.g., full academic papers or books), models natively trained on ultra-long contexts might be more suitable.
Syntactic vs. Semantic Focus: Due to the heavy Contrastive Learning phase, the model prioritizes deep semantic meaning over rigid grammatical syntax (e.g., Linguistic Acceptability tasks).
Training Data Bias: As the model utilizes Wikipedia, news, and internet datasets, it may inherit the social, cultural, and political biases present in the Turkish web corpus.

🏫 Institution & Acknowledgements

Institution: TÜBİTAK Fen Lisesi (TÜBİTAK Science High School) - TFLai (Artificial Intelligence Club)
Special Thanks & Citations: This model was made possible thanks to the incredible open-source contributions of the NLP community. We extend our deepest gratitude to:
- Boğaziçi University TABİ Lab for their inspirational work and for providing the comprehensive TabiBench evaluation suite.
- mrbesher for the highly efficient encoder-fast-eval benchmark framework used to evaluate this model.
- Diffutron for supplying the massive, high-quality Turkish pre-training corpus used in Phase 1.
- Helsinki-NLP for the OPUS-100 English-Turkish translation pairs that enabled our cross-lingual transfer strategy.
- emrecan for curating the all-nli-tr triplet dataset, which was the absolute cornerstone of our Contrastive Learning phase.

📝 Citation

If you use this model in your research or projects, please cite it as follows:

@misc{kocabay2026tifilbert,
  title={tifilBERT: Highly Efficient Turkish Sentence Encoder via Contrastive Transfer Learning},
  author={Kocabay, {\c{S}}uayp Talha and TFLai},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{[https://huggingface.co/TFLai/tifilBERT-Base](https://huggingface.co/TFLai/tifilBERT-Base)}}
}

Downloads last month: 203

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for TFLai/tifilBERT-Base

Base model

jhu-clsp/mmBERT-small

Finetuned

(34)

this model

TFLai
/

tifilBERT-Base