Instructions to use latincy/la_stanza_latincy with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Stanza
How to use latincy/la_stanza_latincy with Stanza:
import stanza stanza.download("la_stanza_latincy") nlp = stanza.Pipeline("la_stanza_latincy") - Notebooks
- Google Colab
- Kaggle
LatinCy Stanza (la_stanza_latincy)
A Stanza (Stanford NLP) model suite for Latin trained on harmonized Universal Dependencies treebanks from LatinCy. Provides tokenization, POS tagging, morphological features, lemmatization, dependency parsing, and named entity recognition.
Highlights
- Full NLP pipeline -- tokenizer, POS/morph tagger, lemmatizer, dependency parser, NER
- 6 UD treebanks + LASLA: POS/morph/lemma trained on ~2.87M tokens (UD+LASLA combined)
- Custom character language models trained on 1.6 GB of curated Latin text (13.7M sentences)
- Custom word vectors (CBOW-300, trained on curated Latin corpus)
- NER with 3 entity types: PERSON, LOC, NORP
Quick Start
import stanza
from huggingface_hub import snapshot_download
# Download models (one time)
model_dir = snapshot_download("latincy/la_stanza_latincy")
# Load pipeline
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None)
# Annotate
doc = nlp("Gallia est omnis divisa in partes tres.")
for sent in doc.sentences:
for word in sent.words:
print(f"{word.text:12s} {word.upos:6s} {word.lemma:12s} {word.deprel}")
Output:
Gallia PROPN Gallia nsubj:pass
est AUX sum aux:pass
omnis DET omnis det
divisa VERB divido root
in ADP in case
partes NOUN pars obl
tres NUM tres nummod
. PUNCT . punct
NER
nlp = stanza.Pipeline("la", dir=model_dir, download_method=None,
processors="tokenize,ner")
doc = nlp("Caesar in Galliam cum legionibus contendit.")
for ent in doc.ents:
print(f"{ent.text:20s} {ent.type}")
Loading from a Local Directory
If you have the models locally (e.g., after cloning the HuggingFace repo):
nlp = stanza.Pipeline("la", dir="/path/to/la_stanza_latincy",
download_method=None)
Model Description
| Property | Value |
|---|---|
| Author | Patrick J. Burns / LatinCy |
| Model type | Stanza neural pipeline (BiLSTM-CRF, biaffine parser) |
| Language | Latin |
| License | MIT |
| Total size | ~1.1 GB (8 model files) |
| Framework | Stanza (Stanford NLP) |
Pipeline Components
| Component | Model File | Architecture |
|---|---|---|
| Tokenizer | tokenize/latincy.pt (11 MB) |
BiLSTM segmenter |
| POS/Morph | pos/latincy.pt (143 MB) |
BiLSTM tagger with CharLM + pretrained vectors |
| Lemmatizer | lemma/latincy.pt (46 MB) |
Seq2seq with edit classifier |
| Dep. Parser | depparse/latincy.pt (170 MB) |
Deep biaffine attention parser |
| NER | ner/latincy.pt (151 MB) |
BiLSTM-CRF with CharLM + pretrained vectors |
| CharLM (fwd) | forward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| CharLM (bwd) | backward_charlm/latincy.pt (197 MB) |
Character-level LSTM language model |
| Pretrain | pretrain/latincy.pt (174 MB) |
Word2Vec CBOW-300 embeddings |
Training Data
POS, Morphology, Lemmatization (UD + LASLA)
Trained on harmonized data from 6 Universal Dependencies Latin treebanks combined with the LASLA corpus (~1.84M tokens of classical Latin with POS, morphological features, and lemmas).
| Treebank | Full Name | Domain |
|---|---|---|
| ITTB | Index Thomisticus Treebank | Scholastic Latin (Thomas Aquinas) |
| LLCT | Late Latin Charter Treebank | Medieval legal charters |
| PROIEL | PROIEL Treebank | Vulgate Bible, historical texts |
| Perseus | Perseus Latin Treebank | Classical Latin (Caesar, Cicero, etc.) |
| UDante | UDante Treebank | Dante Alighieri (De vulgari eloquentia, etc.) |
| CIRCSE | CIRCSE Latin Treebank | LASLA-derived classical texts |
| LASLA | LASLA corpus | Classical Latin (morphology only, no deps) |
Combined: ~2.87M tokens for POS/morph/lemma; ~1.03M tokens (UD only) for tokenizer and dependency parsing.
NER
Trained on LatinCy NER annotations from 4 sources: 13,493 train / 3,195 dev sentences. Entity types: PERSON (79%), LOC (14%), NORP (7%).
Character Language Models
Trained on 1.6 GB of curated Latin text (13.7M sentences from 9 sources) for 15 epochs. Forward and backward CharLMs provide contextualized character-level features to the POS tagger, lemmatizer, parser, and NER.
Training Procedure
Tokenizer: BiLSTM segmenter trained on UD-only data.
POS/Morph tagger: BiLSTM with CharLM features and pretrained word vectors, trained on UD+LASLA combined data.
Lemmatizer: Seq2seq model with edit classifier, CharLM features, trained on UD+LASLA combined data.
Dependency parser: Deep biaffine attention parser with CharLM features and pretrained word vectors, trained on UD-only data.
NER tagger: BiLSTM-CRF with CharLM features and pretrained word vectors, 8,500 training steps with early stopping.
Evaluation Results
Overall Scores
| Component | Metric | v0.2 (CharLM) | v0.3 (Latin BERT) | Best | Split |
|---|---|---|---|---|---|
| Tokenizer | Token F1 | 98.24 | โ | v0.2 | dev |
| Tokenizer | Sentence F1 | 86.59 | โ | v0.2 | dev |
| POS | UPOS | 97.26 | 97.65 | v0.3 | test |
| POS | XPOS | โ | 97.38 | v0.3 | test |
| POS | UFeats | 92.80 | 93.93 | v0.3 | test |
| POS | AllTags | โ | 92.51 | v0.3 | test |
| Lemma | Accuracy | 97.87 | โ | v0.2 | test |
| Dep. Parse | UAS | 86.95 | 86.20 | v0.2 | test |
| Dep. Parse | LAS | 83.23 | 81.98 | v0.2 | test |
| Dep. Parse | MLAS | 76.96 | 75.23 | v0.2 | test |
| Dep. Parse | BLEX | 79.46 | 78.00 | v0.2 | test |
| NER | Entity F1 | 90.22 | 90.17 | v0.2 | dev |
| NER | PERSON F1 | 93.01 | 93.41 | v0.3 | dev |
| NER | LOC F1 | 80.88 | 79.47 | v0.2 | dev |
| NER | NORP F1 | 78.44 | 76.00 | v0.2 | dev |
v0.3 trained a Latin BERT (Bamman & Burns 2020) transformer backend for POS and it improved all POS metrics. Depparse and NER perform best with CharLM alone.
v0.3.1 retracts the Latin BERT POS model. Latin BERT ships with a custom fast tokenizer (tokenization_latin_bert_fast.py) that requires trust_remote_code=True. Stanza's bert_embedding.load_tokenizer does not pass that flag, so the BERT POS checkpoint fails to load end-to-end from the published HF repo. v0.3.1 reverts POS to the CharLM backend (numbers match the v0.2 column above). All other components are unchanged from v0.3. A transformer POS will return once the Stanza/Latin BERT integration is resolved.
Cross-Framework Comparison
All models trained on the same harmonized treebank data. Scores on held-out test sets unless noted. NER scores are on dev (no test set exists).
| Metric | LatinCy Stanza 0.3.1 |
LatinCy Flair 0.3 |
LatinCy UDPipe 0.2 |
LatinCy spaCy trf 3.9 |
|---|---|---|---|---|
| UPOS | 97.26 | 98.02 | 94.07 | 97.34 |
| UFeats | 92.80 | -- | 80.82 | 93.95 |
| Lemma | 97.87 | 97.41 | 92.99 | 94.63 |
| UAS | 86.95 | -- | 76.48 | 86.91 |
| LAS | 83.23 | -- | 71.57 | 82.04 |
| NER F1 | 90.22 | 92.22 | -- | 91.14 |
Stanza leads on lemma, UAS, and LAS. Flair 0.3 (Latin BERT) leads on UPOS, UFeats, and NER. spaCy trf is competitive across all metrics. UDPipe offers single-file portability usable from R, Python, CLI, and other platforms.
vs. Stanford's Official Latin Package (stanfordnlp/stanza-la)
Stanford distributes separate per-treebank models (ITTB, LLCT, Perseus, PROIEL, UDante) without character language models (nocharlm variants) and without NER. LatinCy Stanza trains a single unified model across all treebanks plus LASLA, with custom forward/backward CharLMs and pretrained word vectors. A direct benchmark comparison is planned for a future release.
Limitations
- No test split for NER: NER scores are on the dev set; no held-out test evaluation is available.
- Tokenizer scores on dev: No separate test evaluation was run for the tokenizer.
- LASLA data is morphology-only: Dependency parsing trained on UD data only (~1.03M tokens), not the full 2.87M token corpus.
- No transformer features: All components use BiLSTM + CharLM. A Latin BERT POS variant was trained for v0.3 but retracted in v0.3.1 due to a Stanza/Latin-BERT tokenizer loading incompatibility (see Evaluation Results).
- Large total size: The full model suite is ~1.1 GB due to 8 separate model files (including 2 CharLMs at 197 MB each). Individual components can be loaded selectively.
Future Development
The following Stanza processors are not yet implemented for Latin in this release but will be considered for future development:
- Constituency parsing (phrase structure)
- Coreference resolution
- Sentiment analysis
- Multi-word token (MWT) expansion
Also, we expect to train the next version of LatinCy Stanza using a transformer model for improved accuracy on morphological features and dependency parsing.
Version History
| Version | Date | Treebank Data | Changes |
|---|---|---|---|
| 0.3.1 | 2026-04 | LatinCy v3.9 | Revert POS to the v0.2 CharLM checkpoint. The v0.3 Latin BERT POS model is incompatible with Stanza's BERT tokenizer loader (custom Latin BERT tokenizer requires trust_remote_code=True). All other components unchanged. |
| 0.3 | 2026-03 | LatinCy v3.9 | Latin BERT transformer backend for POS (UPOS +0.39, UFeats +1.13). Best-of per component: Latin BERT POS, CharLM for all others. Retracted in 0.3.1. |
| 0.2 | 2026-03 | LatinCy v3.9 | Retrained POS, lemma, depparse on harmonized treebanks with Gender feature fix. UFeats +0.60, UAS +0.22. |
| 0.1 | 2026-02 | LatinCy v3.8 | Initial release. All components (tokenizer, POS, lemma, depparse, NER, CharLM). |
References
- Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. 2020. "Stanza: A Python Natural Language Processing Toolkit for Many Human Languages." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp.stanford.edu/pubs/qi2020stanza.pdf.
Citation
@misc{burns2026latincystanza,
author = {Burns, Patrick J.},
title = {{LatinCy Stanza (la\_stanza\_latincy)}},
year = {2026},
url = {https://huggingface.co/latincy/la_stanza_latincy},
}
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
- Downloads last month
- -
Evaluation results
- Token F1 on UD Latin (combined)self-reported98.240
- Sentence F1 on UD Latin (combined)self-reported86.590
- UPOS on UD Latin (combined + LASLA)self-reported97.260
- UFeats on UD Latin (combined + LASLA)self-reported92.800
- Lemma Accuracy on UD Latin (combined + LASLA)self-reported97.870
- UAS on UD Latin (combined)self-reported86.950
- LAS on UD Latin (combined)self-reported83.230
- Entity F1 on LatinCy NER (4 sources)self-reported90.220