Latin BERT (Bamman & Burns 2020)

HuggingFace-compatible packaging of the Latin BERT model from:

Bamman, D., & Burns, P.J. (2020). Latin BERT: A Contextual Language Model for Classical Philology. arXiv preprint arXiv:2009.10053.

The original model and training code are available at github.com/dbamman/latin-bert. This repo repackages the same weights for use with HuggingFace transformers.

Note: This is an experimental repackaging. If you encounter any issues, please open a thread in the Discussion tab.

Model Details

  • Architecture: BERT-base (12 layers, 768 hidden, 12 attention heads)
  • Parameters: ~111M
  • Vocab size: 32,900 (SubwordTextEncoder)
  • Max sequence length: 512
  • Training data: Latin texts (see paper for details)

Install

pip install transformers torch

Usage

Basic: Get contextual embeddings

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = AutoModel.from_pretrained("latincy/latin-bert")

inputs = tokenizer("Gallia est omnis divisa in partes tres", return_tensors="pt")
outputs = model(**inputs)

# outputs.last_hidden_state: (batch, seq_len, 768)

Masked language model (fill-mask)

from transformers import AutoTokenizer, BertForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = BertForMaskedLM.from_pretrained("latincy/latin-bert")

text = "Gallia est omnis [MASK] in partes tres"
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits

top5 = logits[0, mask_idx, :].topk(5).indices.squeeze()
for token_id in top5:
    print(tokenizer.decode([token_id.item()]))

Custom Tokenizer

The original Latin BERT uses a tensor2tensor SubwordTextEncoder, not standard WordPiece. This repo includes a faithful reimplementation as a HuggingFace PreTrainedTokenizer β€” this is why trust_remote_code=True is required.

Verified against the original case studies from the paper:

POS tagging (Table 1)

Treebank Accuracy
Perseus 95.2%
PROIEL 98.2%
ITTB 99.2%

Masked word prediction (Table 3)

Metric Score
P@1 33.1%
P@10 62.2%
P@50 74.0%

spaCy Integration

Works with spacy-transformers:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "latincy/latin-bert"

[components.transformer.model.tokenizer_config]
trust_remote_code = true
use_fast = false

Changelog

v1.1.1 β€” Bug fix: add do_lower_case=True to tokenizer

The original Latin BERT vocabulary was trained on lowercased text. All original case studies (POS tagging, WSD, infilling) explicitly called .lower() before tokenizing. The HF PreTrainedTokenizer wrapper was missing this step, causing uppercase characters to be escaped to their ASCII codepoints (e.g. C β†’ \67;), inflating token counts ~4x and producing embeddings the model was never trained on. The tokenizer now lowercases input by default (do_lower_case=True), matching the original pipeline behavior.

v1.1.0 β€” HuggingFace repackaging

Repackaged the original tensor2tensor SubwordTextEncoder tokenizer and PyTorch weights as a HuggingFace PreTrainedTokenizer + safetensors model.

v1.0.0 β€” Original model

Bamman & Burns (2020) Latin BERT weights and tensor2tensor tokenizer.

Citation

@article{bamman2020latin,
  title={Latin BERT: A Contextual Language Model for Classical Philology},
  author={Bamman, David and Burns, Patrick J},
  journal={arXiv preprint arXiv:2009.10053},
  year={2020}
}
Downloads last month
156
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for latincy/latin-bert