--- tags: - spacy - token-classification - named-entity-linking - wikidata - serbian language: - sr license: cc-by-sa-4.0 library_name: spacy pipeline_tag: token-classification model-index: - name: sr_nel_all results: - task: name: NER type: token-classification metrics: - name: NER Precision type: precision value: 0.9334730579 - name: NER Recall type: recall value: 0.9340414017 - name: NER F Score type: f_score value: 0.9337571433 - task: name: TAG type: token-classification metrics: - name: TAG (XPOS) Accuracy type: accuracy value: 0.9648654551 - task: name: Named Entity Linking type: token-classification dataset: name: sr-geography type: sr-geography metrics: - name: srNEL-all Precision type: precision value: 0.986 - name: srNEL-all Recall type: recall value: 0.740 - name: srNEL-all F1 type: f1 value: 0.845 --- # srNEL-all: Serbian Named Entity Linking with spaCy `sr_nel_all` is a spaCy pipeline for Serbian named entity recognition and named entity linking. It detects named entities in Serbian text and links recognized mentions to Wikidata identifiers. The model corresponds to the `srNEL-all` configuration from the accepted paper **CNN-based Named Entity Linking: Serbian Use Case**. It is a CNN-based spaCy model that uses the SrpCNNER2 NER base and trains the entity linker on all available entity types. ## Intended Use This model is intended for Serbian NLP workflows that need named entities connected to Wikidata QIDs, especially geolocational entity linking in Serbian educational, geographical, literary, news, and related text. Recommended uses: - Linking Serbian location mentions to Wikidata. - Enriching Serbian texts with structured entity identifiers. - Building downstream information retrieval, corpus analysis, digital humanities, and knowledge base enrichment workflows. - Research comparisons for Serbian NER and NEL. The model is strongest on geolocational entity linking. Broader cross-domain use should be validated on the target corpus before production use. ## Installation and Usage Install the wheel from this repository, or download the model files and load the local spaCy package. ```bash pip install sr_nel_all-any-py3-none-any.whl ``` ```python import spacy nlp = spacy.load("sr_nel_all") doc = nlp("Poljska se graniči sa sedam zemalja, uključujući Nemačku i Ukrajinu.") for ent in doc.ents: print(ent.text, ent.label_, ent.kb_id_) ``` The output contains detected entity spans, their NER labels, and the Wikidata knowledge base identifier assigned by the entity linker. ## Pipeline | Feature | Value | | --- | --- | | Model name | `sr_NEL_all` | | Version | `1.0.0` | | Language | Serbian (`sr`) | | Framework | spaCy | | spaCy version | `>=3.5.2,<3.6.0` | | Architecture | CNN-based spaCy pipeline with entity linker | | Pipeline | `tok2vec`, `tagger`, `ner`, `sentencizer`, `entity_linker` | | Vectors | 0 keys, 0 unique vectors, 0 dimensions | | License | CC BY-SA 4.0 | | Authors | Milica Ikonić Nešić, Saša Petalinkar, Ranka Stanković, Miloš Utvić, Olivera Kitanović | | Project page | [TESLA](https://tesla.rgf.bg.ac.rs/) | ## Labels The NER component recognizes seven named entity categories: | Label | Description | | --- | --- | | `DEMO` | Demonyms | | `EVENT` | Events | | `LOC` | Locations | | `ORG` | Organizations | | `PERS` | Persons | | `ROLE` | Professions, titles, and roles | | `WORK` | Works of art | The POS tagger uses the following XPOS-style labels: `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `VERB`, `X`. ## Training Data The `srNEL-all` model was trained from a Serbian corpus of **73,493 sentences** containing manually checked named entity and entity linking annotations. The training corpus combines: - Serbian novels. - Newspaper articles. - Legal documents. - Wikipedia and sr-ELEXIS material. - Synthetic sentences generated from Wikidata and the Leximirka lexical database. Entity distribution in the expanded dataset: | Entity type | Mentions | | --- | ---: | | `LOC` | 36,655 | | `ORG` | 11,061 | | `PERS` | 13,636 | For locations, **35,712 LOC mentions** were linked to Wikidata QIDs, while **943 LOC mentions** were assigned NIL links because no suitable Wikidata item was available at annotation time. The linker was trained on seven entity types: `PERS`, `LOC`, `ORG`, `ROLE`, `WORK`, `DEMO`, and `EVENT`. The train/test split used for the NEL training setup was **58,618 training sentences** and **14,875 test sentences**. ## Knowledge Base The entity linker uses a curated Wikidata-aligned Serbian knowledge base. For `srNEL-all`, the KB contains **3,008 entities**. Entities are represented with Wikidata QIDs, aliases, and Serbian Wikipedia descriptions where available. The KB also includes inflectional forms as aliases, which is important for Serbian because named entities frequently appear in declined forms. The KB covers categories including cities, countries, rivers, mountains, seas, oceans, islands, peninsulas, continents, administrative units, localities, organizations, persons, geographic regions, planets, and other entity classes used in the model. ## Evaluation The main external evaluation described in the associated paper uses the **sr-geography** corpus, a Serbian geography textbook corpus for elementary school students. The sr-geography evaluation set contains: - 710 sentences. - 2,297 words. - 746 annotated geolocational entities. - 212 unique Wikidata QIDs. Evaluation used a strict criterion: a prediction is counted as correct only when both the entity span and the Wikidata QID match the gold annotation. ### sr-geography NEL Results | Model | Precision | Recall | F1 | | --- | ---: | ---: | ---: | | `srNEL-all` | 0.986 | 0.740 | 0.845 | | `SrpCNNeL` baseline | n/a | n/a | 0.731 | The `srNEL-all` configuration achieved the strongest CNN-based result in the reported comparison, outperforming the earlier `SrpCNNeL` baseline on geolocational entity linking. ### Internal spaCy Package Metrics | Metric | Score | | --- | ---: | | XPOS accuracy | 0.9649 | | NER precision | 0.9335 | | NER recall | 0.9340 | | NER F1 | 0.9338 | NER performance by entity type: | Entity type | Precision | Recall | F1 | | --- | ---: | ---: | ---: | | `ROLE` | 0.8352 | 0.8221 | 0.8286 | | `PERS` | 0.9713 | 0.9787 | 0.9750 | | `LOC` | 0.9330 | 0.9697 | 0.9510 | | `DEMO` | 0.8740 | 0.8520 | 0.8628 | | `ORG` | 0.7676 | 0.6544 | 0.7065 | | `WORK` | 0.6563 | 0.2958 | 0.4078 | | `EVENT` | 0.5556 | 0.3125 | 0.4000 | ## Limitations - The model is strongest for Serbian geolocational entity linking and should be evaluated before use in other domains. - The external evaluation corpus is focused on geography textbook text, so reported NEL results may not generalize directly to news, literary, legal, or web text. - Multi-word entities are a known source of errors, especially Serbian toponyms with inflection or complex names. - Rare categories such as islands, oceans, planets, and geographic regions require more evaluation data. - The system depends on Wikidata coverage. Mentions without a suitable Wikidata item may receive NIL links or remain unresolved. - CNN-based pipelines are efficient, but transformer-based models may offer stronger accuracy for some Serbian NER/NEL scenarios. ## Citation The paper describing this model is in press: ```bibtex @article{IkonicNesic2026CNN, author = {Ikoni{\'c} Ne{\v{s}}i{\'c}, M. and Petalinkar, S. and Kitanovi{\'c}, O. and Stankovi{\'c}, R. and Utvi{\'c}, M.}, title = {CNN-based Named Entity Linking: Serbian Use Case}, journal = {Poznan Studies in Contemporary Linguistics}, year = {2026}, note = {In press}, } ``` ## Acknowledgments This research was supported by the Science Fund of the Republic of Serbia, project **Text Embeddings - Serbian Language Applications - TESLA**. The work also acknowledges the use of Serbian linguistic resources and corpora described in the associated paper, including Leximirka, Wikidata-derived data, sr-ELEXIS, SrpELTeC-related resources, and the sr-geography evaluation material.