---
tags:
- spacy
- token-classification
- named-entity-linking
- wikidata
- serbian
language:
- sr
license: cc-by-sa-4.0
library_name: spacy
pipeline_tag: token-classification
model-index:
- name: sr_nel_all
  results:
  - task:
      name: NER
      type: token-classification
    metrics:
    - name: NER Precision
      type: precision
      value: 0.9334730579
    - name: NER Recall
      type: recall
      value: 0.9340414017
    - name: NER F Score
      type: f_score
      value: 0.9337571433
  - task:
      name: TAG
      type: token-classification
    metrics:
    - name: TAG (XPOS) Accuracy
      type: accuracy
      value: 0.9648654551
  - task:
      name: Named Entity Linking
      type: token-classification
    dataset:
      name: sr-geography
      type: sr-geography
    metrics:
    - name: srNEL-all Precision
      type: precision
      value: 0.986
    - name: srNEL-all Recall
      type: recall
      value: 0.740
    - name: srNEL-all F1
      type: f1
      value: 0.845
---

# srNEL-all: Serbian Named Entity Linking with spaCy

`sr_nel_all` is a spaCy pipeline for Serbian named entity recognition and named entity linking. It detects named entities in Serbian text and links recognized mentions to Wikidata identifiers.

The model corresponds to the `srNEL-all` configuration from the accepted paper **CNN-based Named Entity Linking: Serbian Use Case**. It is a CNN-based spaCy model that uses the SrpCNNER2 NER base and trains the entity linker on all available entity types.

## Intended Use

This model is intended for Serbian NLP workflows that need named entities connected to Wikidata QIDs, especially geolocational entity linking in Serbian educational, geographical, literary, news, and related text.

Recommended uses:

- Linking Serbian location mentions to Wikidata.
- Enriching Serbian texts with structured entity identifiers.
- Building downstream information retrieval, corpus analysis, digital humanities, and knowledge base enrichment workflows.
- Research comparisons for Serbian NER and NEL.

The model is strongest on geolocational entity linking. Broader cross-domain use should be validated on the target corpus before production use.

## Installation and Usage

Install the wheel from this repository, or download the model files and load the local spaCy package.

```bash
pip install sr_nel_all-any-py3-none-any.whl
```

```python
import spacy

nlp = spacy.load("sr_nel_all")
doc = nlp("Poljska se graniči sa sedam zemalja, uključujući Nemačku i Ukrajinu.")

for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)
```

The output contains detected entity spans, their NER labels, and the Wikidata knowledge base identifier assigned by the entity linker.

## Pipeline

| Feature | Value |
| --- | --- |
| Model name | `sr_NEL_all` |
| Version | `1.0.0` |
| Language | Serbian (`sr`) |
| Framework | spaCy |
| spaCy version | `>=3.5.2,<3.6.0` |
| Architecture | CNN-based spaCy pipeline with entity linker |
| Pipeline | `tok2vec`, `tagger`, `ner`, `sentencizer`, `entity_linker` |
| Vectors | 0 keys, 0 unique vectors, 0 dimensions |
| License | CC BY-SA 4.0 |
| Authors | Milica Ikonić Nešić, Saša Petalinkar, Ranka Stanković, Miloš Utvić, Olivera Kitanović |
| Project page | [TESLA](https://tesla.rgf.bg.ac.rs/) |

## Labels

The NER component recognizes seven named entity categories:

| Label | Description |
| --- | --- |
| `DEMO` | Demonyms |
| `EVENT` | Events |
| `LOC` | Locations |
| `ORG` | Organizations |
| `PERS` | Persons |
| `ROLE` | Professions, titles, and roles |
| `WORK` | Works of art |

The POS tagger uses the following XPOS-style labels:

`ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `VERB`, `X`.

## Training Data

The `srNEL-all` model was trained from a Serbian corpus of **73,493 sentences** containing manually checked named entity and entity linking annotations.

The training corpus combines:

- Serbian novels.
- Newspaper articles.
- Legal documents.
- Wikipedia and sr-ELEXIS material.
- Synthetic sentences generated from Wikidata and the Leximirka lexical database.

Entity distribution in the expanded dataset:

| Entity type | Mentions |
| --- | ---: |
| `LOC` | 36,655 |
| `ORG` | 11,061 |
| `PERS` | 13,636 |

For locations, **35,712 LOC mentions** were linked to Wikidata QIDs, while **943 LOC mentions** were assigned NIL links because no suitable Wikidata item was available at annotation time.

The linker was trained on seven entity types: `PERS`, `LOC`, `ORG`, `ROLE`, `WORK`, `DEMO`, and `EVENT`. The train/test split used for the NEL training setup was **58,618 training sentences** and **14,875 test sentences**.

## Knowledge Base

The entity linker uses a curated Wikidata-aligned Serbian knowledge base.

For `srNEL-all`, the KB contains **3,008 entities**. Entities are represented with Wikidata QIDs, aliases, and Serbian Wikipedia descriptions where available. The KB also includes inflectional forms as aliases, which is important for Serbian because named entities frequently appear in declined forms.

The KB covers categories including cities, countries, rivers, mountains, seas, oceans, islands, peninsulas, continents, administrative units, localities, organizations, persons, geographic regions, planets, and other entity classes used in the model.

## Evaluation

The main external evaluation described in the associated paper uses the **sr-geography** corpus, a Serbian geography textbook corpus for elementary school students.

The sr-geography evaluation set contains:

- 710 sentences.
- 2,297 words.
- 746 annotated geolocational entities.
- 212 unique Wikidata QIDs.

Evaluation used a strict criterion: a prediction is counted as correct only when both the entity span and the Wikidata QID match the gold annotation.

### sr-geography NEL Results

| Model | Precision | Recall | F1 |
| --- | ---: | ---: | ---: |
| `srNEL-all` | 0.986 | 0.740 | 0.845 |
| `SrpCNNeL` baseline | n/a | n/a | 0.731 |

The `srNEL-all` configuration achieved the strongest CNN-based result in the reported comparison, outperforming the earlier `SrpCNNeL` baseline on geolocational entity linking.

### Internal spaCy Package Metrics

| Metric | Score |
| --- | ---: |
| XPOS accuracy | 0.9649 |
| NER precision | 0.9335 |
| NER recall | 0.9340 |
| NER F1 | 0.9338 |

NER performance by entity type:

| Entity type | Precision | Recall | F1 |
| --- | ---: | ---: | ---: |
| `ROLE` | 0.8352 | 0.8221 | 0.8286 |
| `PERS` | 0.9713 | 0.9787 | 0.9750 |
| `LOC` | 0.9330 | 0.9697 | 0.9510 |
| `DEMO` | 0.8740 | 0.8520 | 0.8628 |
| `ORG` | 0.7676 | 0.6544 | 0.7065 |
| `WORK` | 0.6563 | 0.2958 | 0.4078 |
| `EVENT` | 0.5556 | 0.3125 | 0.4000 |

## Limitations

- The model is strongest for Serbian geolocational entity linking and should be evaluated before use in other domains.
- The external evaluation corpus is focused on geography textbook text, so reported NEL results may not generalize directly to news, literary, legal, or web text.
- Multi-word entities are a known source of errors, especially Serbian toponyms with inflection or complex names.
- Rare categories such as islands, oceans, planets, and geographic regions require more evaluation data.
- The system depends on Wikidata coverage. Mentions without a suitable Wikidata item may receive NIL links or remain unresolved.
- CNN-based pipelines are efficient, but transformer-based models may offer stronger accuracy for some Serbian NER/NEL scenarios.

## Citation

The paper describing this model is in press:

```bibtex
@article{IkonicNesic2026CNN,
  author    = {Ikoni{\'c} Ne{\v{s}}i{\'c}, M. and Petalinkar, S. and Kitanovi{\'c}, O. and Stankovi{\'c}, R. and Utvi{\'c}, M.},
  title     = {CNN-based Named Entity Linking: Serbian Use Case},
  journal   = {Poznan Studies in Contemporary Linguistics},
  year      = {2026},
  note      = {In press},
}
```

## Acknowledgments

This research was supported by the Science Fund of the Republic of Serbia, project **Text Embeddings - Serbian Language Applications - TESLA**. The work also acknowledges the use of Serbian linguistic resources and corpora described in the associated paper, including Leximirka, Wikidata-derived data, sr-ELEXIS, SrpELTeC-related resources, and the sr-geography evaluation material.