| --- |
| language: |
| - da |
| license: mit |
| tags: |
| - named-entity-recognition |
| - token-classification |
| - danish |
| - ner |
| - xlm-roberta |
| - pii-detection |
| - gdpr |
| datasets: |
| - chcaa/dansk-ner |
| - KennethEnevoldsen/dane_plus |
| - ltg/norne |
| pipeline_tag: token-classification |
| model-index: |
| - name: danish-xlmr-ner-large |
| results: |
| - task: |
| type: token-classification |
| name: Named Entity Recognition |
| dataset: |
| type: chcaa/dansk-ner |
| name: DANSK (dev) |
| metrics: |
| - type: f1 |
| value: 87.6 |
| name: F1 (micro) |
| - type: precision |
| value: 86.6 |
| name: Precision |
| - type: recall |
| value: 88.7 |
| name: Recall |
| - task: |
| type: token-classification |
| name: Named Entity Recognition |
| dataset: |
| type: chcaa/dansk-ner |
| name: DANSK (test) |
| metrics: |
| - type: f1 |
| value: 86.5 |
| name: F1 (micro) |
| - type: precision |
| value: 85.4 |
| name: Precision |
| - type: recall |
| value: 87.5 |
| name: Recall |
| - task: |
| type: token-classification |
| name: Named Entity Recognition |
| dataset: |
| type: KennethEnevoldsen/dane_plus |
| name: DaNE (dev) |
| metrics: |
| - type: f1 |
| value: 93.0 |
| name: F1 (micro) |
| - type: precision |
| value: 93.2 |
| name: Precision |
| - type: recall |
| value: 92.9 |
| name: Recall |
| - task: |
| type: token-classification |
| name: Named Entity Recognition |
| dataset: |
| type: KennethEnevoldsen/dane_plus |
| name: DaNE (test) |
| metrics: |
| - type: f1 |
| value: 87.7 |
| name: F1 (micro) |
| - type: precision |
| value: 88.1 |
| name: Precision |
| - type: recall |
| value: 87.3 |
| name: Recall |
| --- |
| |
| # Danish XLM-R NER Large (Two-Stage) |
|
|
| A Danish Named Entity Recognition model based on `xlm-roberta-large` (560M parameters), fine-tuned in two stages for high-recall PII detection in Danish text. |
|
|
| ## Model Description |
|
|
| This model detects three entity types relevant for GDPR-compliant PII processing: |
|
|
| - **PER** - Person names |
| - **ORG** - Organizations (companies, institutions, government bodies) |
| - **LOC** - Locations (addresses, cities, countries) |
|
|
| MISC is intentionally excluded to reduce noise and focus on actionable PII entities. |
|
|
| ### Training |
|
|
| **Two-stage fine-tuning approach:** |
|
|
| 1. **Stage 1** (broad NER): DANSK + DaNE + NorNE, 10 epochs, LR 2e-5 |
| 2. **Stage 2** (domain adaptation): DANSK-only, 1 epoch, LR 5e-6 |
|
|
| This approach achieves the best balance between multi-domain generalization and Danish-specific performance. |
|
|
| ### Datasets |
|
|
| | Dataset | Role | Size | Domains | |
| |---------|------|------|---------| |
| | [DANSK](https://huggingface.co/datasets/chcaa/dansk-ner) | Primary | 11.7K train | Web, News, Wiki, Legal, Dannet, Conversation, Social Media | |
| | [DaNE](https://huggingface.co/datasets/KennethEnevoldsen/dane_plus) | Supplementary | 4.4K train | News | |
| | [NorNE](https://huggingface.co/datasets/ltg/norne) | Stage 1 only | ~20K train | News (Norwegian Bokmaal + Nynorsk) | |
|
|
| ## Evaluation Results |
|
|
| ### DANSK (primary benchmark, multi-domain) |
|
|
| | Split | PER F1 | ORG F1 | LOC F1 | Micro F1 | Precision | Recall | |
| |-------|--------|--------|--------|----------|-----------|--------| |
| | Dev | 88.0 | 85.3 | 90.3 | **87.6** | 86.6 | 88.7 | |
| | Test | 84.8 | 84.6 | 90.3 | **86.5** | 85.4 | 87.5 | |
|
|
| ### DaNE (secondary benchmark, news domain) |
|
|
| | Split | PER F1 | ORG F1 | LOC F1 | Micro F1 | Precision | Recall | |
| |-------|--------|--------|--------|----------|-----------|--------| |
| | Dev | 97.5 | 85.1 | 92.9 | **93.0** | 93.2 | 92.9 | |
| | Test | 94.2 | 79.7 | 87.8 | **87.7** | 88.1 | 87.3 | |
|
|
| ### GPI Legal Documents (independent evaluation, Danish legal domain) |
|
|
| Evaluated on 30 human-corrected documents (contracts, invoices, case briefs, client letters): |
|
|
| | Entity | Precision | Recall | Notes | |
| |--------|-----------|--------|-------| |
| | PER | 0.76 | 1.00 | Perfect recall; FPs are email addresses misclassified as PER | |
| | ORG | 0.94 | 0.96 | Near-perfect | |
| | LOC | 0.52 | 0.51 | Boundary errors (detects street, misses house number) — detection rate is near-perfect | |
|
|
| LOC score reflects strict span matching. The model consistently detects location entities but predicts shorter spans (e.g., "Gothersgade" instead of "Gothersgade 81"). A post-processing step to extend LOC spans to include adjacent numbers resolves this. |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline |
| |
| model_name = "thomasbeste/danish-xlmr-ner-large" |
| nlp = pipeline("ner", model=model_name, aggregation_strategy="simple") |
| |
| text = "Anders Jensen fra Danske Bank bor på Vestergade 42 i København." |
| entities = nlp(text) |
| |
| for ent in entities: |
| print(f" {ent['entity_group']}: {ent['word']} (score: {ent['score']:.3f})") |
| ``` |
|
|
| ### ONNX Deployment |
|
|
| For production use, export to ONNX INT8 for ~3x CPU speedup: |
|
|
| ```bash |
| pip install optimum[onnxruntime] |
| |
| # Export to ONNX |
| optimum-cli export onnx --model thomasbeste/danish-xlmr-ner-large ./model-onnx --task token-classification |
| |
| # Quantize to INT8 |
| python -c " |
| from optimum.onnxruntime import ORTQuantizer |
| from optimum.onnxruntime.configuration import AutoQuantizationConfig |
| q = ORTQuantizer.from_pretrained('./model-onnx') |
| qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False) |
| q.quantize(save_dir='./model-onnx-int8', quantization_config=qconfig) |
| " |
| ``` |
|
|
| ## Label Scheme |
|
|
| IOB2 format with 7 labels: |
|
|
| | ID | Label | |
| |----|-------| |
| | 0 | O | |
| | 1 | B-PER | |
| | 2 | I-PER | |
| | 3 | B-ORG | |
| | 4 | I-ORG | |
| | 5 | B-LOC | |
| | 6 | I-LOC | |
|
|
| ## Intended Use |
|
|
| Designed for GDPR-compliant PII detection in Danish enterprise document processing pipelines. Optimized for **recall over precision** — a missed entity (false negative) is a compliance risk, while over-detection (false positive) is safe. |
|
|
| ## Limitations |
|
|
| - Optimized for Danish text. May work on other Scandinavian languages (Norwegian, Swedish) but not evaluated. |
| - LOC boundary detection tends to predict shorter spans than the full address. Post-processing recommended. |
| - Email addresses are sometimes misclassified as PER. Downstream validation (reject names containing `@`) is recommended. |
|
|