thomasbeste's picture
Upload folder using huggingface_hub
1df2ef5 verified
---
language:
- da
license: mit
tags:
- named-entity-recognition
- token-classification
- danish
- ner
- xlm-roberta
- pii-detection
- gdpr
datasets:
- chcaa/dansk-ner
- KennethEnevoldsen/dane_plus
- ltg/norne
pipeline_tag: token-classification
model-index:
- name: danish-xlmr-ner-large
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
type: chcaa/dansk-ner
name: DANSK (dev)
metrics:
- type: f1
value: 87.6
name: F1 (micro)
- type: precision
value: 86.6
name: Precision
- type: recall
value: 88.7
name: Recall
- task:
type: token-classification
name: Named Entity Recognition
dataset:
type: chcaa/dansk-ner
name: DANSK (test)
metrics:
- type: f1
value: 86.5
name: F1 (micro)
- type: precision
value: 85.4
name: Precision
- type: recall
value: 87.5
name: Recall
- task:
type: token-classification
name: Named Entity Recognition
dataset:
type: KennethEnevoldsen/dane_plus
name: DaNE (dev)
metrics:
- type: f1
value: 93.0
name: F1 (micro)
- type: precision
value: 93.2
name: Precision
- type: recall
value: 92.9
name: Recall
- task:
type: token-classification
name: Named Entity Recognition
dataset:
type: KennethEnevoldsen/dane_plus
name: DaNE (test)
metrics:
- type: f1
value: 87.7
name: F1 (micro)
- type: precision
value: 88.1
name: Precision
- type: recall
value: 87.3
name: Recall
---
# Danish XLM-R NER Large (Two-Stage)
A Danish Named Entity Recognition model based on `xlm-roberta-large` (560M parameters), fine-tuned in two stages for high-recall PII detection in Danish text.
## Model Description
This model detects three entity types relevant for GDPR-compliant PII processing:
- **PER** - Person names
- **ORG** - Organizations (companies, institutions, government bodies)
- **LOC** - Locations (addresses, cities, countries)
MISC is intentionally excluded to reduce noise and focus on actionable PII entities.
### Training
**Two-stage fine-tuning approach:**
1. **Stage 1** (broad NER): DANSK + DaNE + NorNE, 10 epochs, LR 2e-5
2. **Stage 2** (domain adaptation): DANSK-only, 1 epoch, LR 5e-6
This approach achieves the best balance between multi-domain generalization and Danish-specific performance.
### Datasets
| Dataset | Role | Size | Domains |
|---------|------|------|---------|
| [DANSK](https://huggingface.co/datasets/chcaa/dansk-ner) | Primary | 11.7K train | Web, News, Wiki, Legal, Dannet, Conversation, Social Media |
| [DaNE](https://huggingface.co/datasets/KennethEnevoldsen/dane_plus) | Supplementary | 4.4K train | News |
| [NorNE](https://huggingface.co/datasets/ltg/norne) | Stage 1 only | ~20K train | News (Norwegian Bokmaal + Nynorsk) |
## Evaluation Results
### DANSK (primary benchmark, multi-domain)
| Split | PER F1 | ORG F1 | LOC F1 | Micro F1 | Precision | Recall |
|-------|--------|--------|--------|----------|-----------|--------|
| Dev | 88.0 | 85.3 | 90.3 | **87.6** | 86.6 | 88.7 |
| Test | 84.8 | 84.6 | 90.3 | **86.5** | 85.4 | 87.5 |
### DaNE (secondary benchmark, news domain)
| Split | PER F1 | ORG F1 | LOC F1 | Micro F1 | Precision | Recall |
|-------|--------|--------|--------|----------|-----------|--------|
| Dev | 97.5 | 85.1 | 92.9 | **93.0** | 93.2 | 92.9 |
| Test | 94.2 | 79.7 | 87.8 | **87.7** | 88.1 | 87.3 |
### GPI Legal Documents (independent evaluation, Danish legal domain)
Evaluated on 30 human-corrected documents (contracts, invoices, case briefs, client letters):
| Entity | Precision | Recall | Notes |
|--------|-----------|--------|-------|
| PER | 0.76 | 1.00 | Perfect recall; FPs are email addresses misclassified as PER |
| ORG | 0.94 | 0.96 | Near-perfect |
| LOC | 0.52 | 0.51 | Boundary errors (detects street, misses house number) — detection rate is near-perfect |
LOC score reflects strict span matching. The model consistently detects location entities but predicts shorter spans (e.g., "Gothersgade" instead of "Gothersgade 81"). A post-processing step to extend LOC spans to include adjacent numbers resolves this.
## Usage
```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model_name = "thomasbeste/danish-xlmr-ner-large"
nlp = pipeline("ner", model=model_name, aggregation_strategy="simple")
text = "Anders Jensen fra Danske Bank bor på Vestergade 42 i København."
entities = nlp(text)
for ent in entities:
print(f" {ent['entity_group']}: {ent['word']} (score: {ent['score']:.3f})")
```
### ONNX Deployment
For production use, export to ONNX INT8 for ~3x CPU speedup:
```bash
pip install optimum[onnxruntime]
# Export to ONNX
optimum-cli export onnx --model thomasbeste/danish-xlmr-ner-large ./model-onnx --task token-classification
# Quantize to INT8
python -c "
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
q = ORTQuantizer.from_pretrained('./model-onnx')
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
q.quantize(save_dir='./model-onnx-int8', quantization_config=qconfig)
"
```
## Label Scheme
IOB2 format with 7 labels:
| ID | Label |
|----|-------|
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
## Intended Use
Designed for GDPR-compliant PII detection in Danish enterprise document processing pipelines. Optimized for **recall over precision** — a missed entity (false negative) is a compliance risk, while over-detection (false positive) is safe.
## Limitations
- Optimized for Danish text. May work on other Scandinavian languages (Norwegian, Swedish) but not evaluated.
- LOC boundary detection tends to predict shorter spans than the full address. Post-processing recommended.
- Email addresses are sometimes misclassified as PER. Downstream validation (reject names containing `@`) is recommended.