polish-roberta-base-8k
This is a small model trained using knowledge distillation from polish-roberta-8k. For distillation, we used a combination of several loss functions: KL divergence for the MLM head, as well as MSE and cosine loss to align the last-layer representations of the teacher and the student. The model was trained for one epoch on the Polish subset of the FineWeb 2 dataset. We prepared a packed version of the dataset by concatenating documents into sequences of exactly 8192 tokens. We then trained the model using batches of 32 such sequences and a polynomial learning rate scheduler with a warmup of 1000 iterations and a maximum learning rate of 5e-5.
Evaluation
Evaluation shows that the model compares favorably to other base-sized encoder models (in the 100M-300M parameter range). We conducted fine-tuning experiments on 25 Polish tasks, the same set as for polish-roberta-8k. The table below presents detailed results compared to other popular models: herbert-base-cased and polish-roberta-base-v2.
| TASK TYPE | DOMAIN | METRIC | GROUP | TASK | herbert-base | polish-roberta base-v2 |
polish-roberta base-8k |
|---|---|---|---|---|---|---|---|
| single-label | mixed | accuracy | KLEJ | NKJP-NER | 94,13 | 94,32 | 94,16 |
| single-label | semantics | accuracy | KLEJ | CDSC-E | 94,22 | 94,05 | 94,54 |
| regression | semantics | spearman | KLEJ | CDSC-R | 93,84 | 94,64 | 94,90 |
| single-label | social media | binary-f1 | KLEJ | CBD | 66,36 | 70,57 | 69,35 |
| single-label | reviews | accuracy | KLEJ | POLEMO2.0-IN | 90,50 | 90,97 | 91,27 |
| single-label | reviews | accuracy | KLEJ | POLEMO2.0-OUT | 77,94 | 79,11 | 81,26 |
| single-label | mixed | binary-f1 | KLEJ | DYK | 68,82 | 70,38 | 69,35 |
| single-label | news | binary-f1 | KLEJ | PSC | 98,94 | 98,88 | 98,90 |
| regression | reviews | 1-wmae | KLEJ | AR | 87,74 | 87,83 | 88,05 |
| single-label | finance | accuracy | FinBench | banking-short | 78,35 | 78,75 | 79,79 |
| single-label | finance | accuracy | FinBench | banking-long | 85,09 | 85,03 | 86,99 |
| single-label | finance | accuracy | FinBench | banking77 | 87,29 | 88,26 | 89,27 |
| regression | finance | r2-score | FinBench | fiqa | 52,5 | 56,63 | 57,31 |
| single-label | finance | accuracy | FinBench | fpb | 83,11 | 83,55 | 83,63 |
| multi-label | finance | weighted-f1 | FinBench | gcn | 94,73 | 95,02 | 94,87 |
| single-label | finance | accuracy | FinBench | stooq | 73,33 | 80,25 | 81,32 |
| single-label | social media | accuracy | Other | 8TAGS | 77,81 | 78,03 | 79,21 |
| single-label | social media | accuracy | Other | BAN-PL | 91,71 | 92,19 | 92,62 |
| multi-label | news | weighted-f1 | Other | MIPD | 57,65 | 58,58 | 64,39 |
| single-label | semantics | accuracy | Other | PPC | 84,16 | 87,05 | 86,02 |
| single-label | semantics | accuracy | Other | SICK-E | 85,17 | 86,71 | 86,31 |
| regression | semantics | spearman | Other | SICK-R | 77,82 | 82,58 | 83,16 |
| multi-label | social media | weighted-f1 | Other | TwitterEMO | 67,41 | 66,46 | 68,75 |
| single-label | reviews | accuracy | Other | IMDB | 90,21 | 91,06 | 95,02 |
| multi-label | law | weighted-f1 | Other | EURLEX | 75,37 | 74,51 | 79,12 |
Table 1. Comparison of the mean scores from five fine-tuning runs on 25 discriminative tasks in Polish. The evaluation metrics vary across tasks — the metric used for each task is specified in the METRIC column.
In addition to the detailed results for the three models shown above, we also present a summary of the evaluation of multilingual models that support Polish.
| MODEL | PARAMS | KLEJ (9 tasks) |
FinBench (7 tasks) |
Other (9 tasks) |
Long tasks (4 tasks) |
All tasks (25 tasks) |
|---|---|---|---|---|---|---|
| EuroBERT/EuroBERT-210m | 212M | 77.16 | 76.68 | 72.48 | 80.42 | 75.34 |
| FacebookAI/xlm-roberta-base | 278M | 84.46 | 76.63 | 76.29 | 74.42 | 79.33 |
| jhu-clsp/mmBERT-base | 307M | 83.17 | 76.59 | 80.14 | 82.08 | 80.24 |
| allegro/herbert-base-cased | 124M | 85.83 | 79.20 | 78.59 | 77.08 | 81.37 |
| sdadas/polish-roberta-base-v2 | 124M | 86.75 | 81.07 | 79.69 | 77.30 | 82.62 |
| sdadas/polish-roberta-base-8k | 190M | 86.86 | 81.88 | 81.62 | 81.38 | 83.58 |
Table 2. Comparison of Polish and multilingual models.
Efficiency
The model includes a custom implementation supporting unpadding and sequence packing, which can significantly speed up inference or training while reducing memory consumption (more information here). Using this feature requires Flash Attention and the Transformers library version 5.4 or newer. To use unpadding, initialize the model with the trust_remote_code=true and attn_implementation="flash_attention_2" parameters, along with 16-bit precision.
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"sdadas/polish-roberta-base-8k",
trust_remote_code=True,
attn_implementation="flash_attention_2",
dtype=torch.bfloat16,
device_map="cuda"
)
print(model.__class__.__name__) # UnpadRobertaModel
Citation
@misc{dadas2026longcontext,
title={Long-Context Encoder Models for Polish Language Understanding},
author={Sławomir Dadas and Rafał Poświata and Marek Kozłowski and Małgorzata Grębowiec and Michał Perełkiewicz and Paweł Klimiuk and Przemysław Boruta},
year={2026},
eprint={2603.12191},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.12191}
}
- Downloads last month
- -