Upload folder using huggingface_hub

1df2ef5 verified 3 months ago

6.57 kB

	---
	language:
	- da
	license: mit
	tags:
	- named-entity-recognition
	- token-classification
	- danish
	- ner
	- xlm-roberta
	- pii-detection
	- gdpr
	datasets:
	- chcaa/dansk-ner
	- KennethEnevoldsen/dane_plus
	- ltg/norne
	pipeline_tag: token-classification
	model-index:
	- name: danish-xlmr-ner-large
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: chcaa/dansk-ner
	name: DANSK (dev)
	metrics:
	- type: f1
	value: 87.6
	name: F1 (micro)
	- type: precision
	value: 86.6
	name: Precision
	- type: recall
	value: 88.7
	name: Recall
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: chcaa/dansk-ner
	name: DANSK (test)
	metrics:
	- type: f1
	value: 86.5
	name: F1 (micro)
	- type: precision
	value: 85.4
	name: Precision
	- type: recall
	value: 87.5
	name: Recall
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: KennethEnevoldsen/dane_plus
	name: DaNE (dev)
	metrics:
	- type: f1
	value: 93.0
	name: F1 (micro)
	- type: precision
	value: 93.2
	name: Precision
	- type: recall
	value: 92.9
	name: Recall
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	type: KennethEnevoldsen/dane_plus
	name: DaNE (test)
	metrics:
	- type: f1
	value: 87.7
	name: F1 (micro)
	- type: precision
	value: 88.1
	name: Precision
	- type: recall
	value: 87.3
	name: Recall
	---

	# Danish XLM-R NER Large (Two-Stage)

	A Danish Named Entity Recognition model based on `xlm-roberta-large` (560M parameters), fine-tuned in two stages for high-recall PII detection in Danish text.

	## Model Description

	This model detects three entity types relevant for GDPR-compliant PII processing:

	- PER - Person names
	- ORG - Organizations (companies, institutions, government bodies)
	- LOC - Locations (addresses, cities, countries)

	MISC is intentionally excluded to reduce noise and focus on actionable PII entities.

	### Training

	Two-stage fine-tuning approach:

	1. Stage 1 (broad NER): DANSK + DaNE + NorNE, 10 epochs, LR 2e-5
	2. Stage 2 (domain adaptation): DANSK-only, 1 epoch, LR 5e-6

	This approach achieves the best balance between multi-domain generalization and Danish-specific performance.

	### Datasets

	\| Dataset \| Role \| Size \| Domains \|
	\|---------\|------\|------\|---------\|
	\| [DANSK](https://huggingface.co/datasets/chcaa/dansk-ner) \| Primary \| 11.7K train \| Web, News, Wiki, Legal, Dannet, Conversation, Social Media \|
	\| [DaNE](https://huggingface.co/datasets/KennethEnevoldsen/dane_plus) \| Supplementary \| 4.4K train \| News \|
	\| [NorNE](https://huggingface.co/datasets/ltg/norne) \| Stage 1 only \| ~20K train \| News (Norwegian Bokmaal + Nynorsk) \|

	## Evaluation Results

	### DANSK (primary benchmark, multi-domain)

	\| Split \| PER F1 \| ORG F1 \| LOC F1 \| Micro F1 \| Precision \| Recall \|
	\|-------\|--------\|--------\|--------\|----------\|-----------\|--------\|
	\| Dev \| 88.0 \| 85.3 \| 90.3 \| 87.6 \| 86.6 \| 88.7 \|
	\| Test \| 84.8 \| 84.6 \| 90.3 \| 86.5 \| 85.4 \| 87.5 \|

	### DaNE (secondary benchmark, news domain)

	\| Split \| PER F1 \| ORG F1 \| LOC F1 \| Micro F1 \| Precision \| Recall \|
	\|-------\|--------\|--------\|--------\|----------\|-----------\|--------\|
	\| Dev \| 97.5 \| 85.1 \| 92.9 \| 93.0 \| 93.2 \| 92.9 \|
	\| Test \| 94.2 \| 79.7 \| 87.8 \| 87.7 \| 88.1 \| 87.3 \|

	### GPI Legal Documents (independent evaluation, Danish legal domain)

	Evaluated on 30 human-corrected documents (contracts, invoices, case briefs, client letters):

	\| Entity \| Precision \| Recall \| Notes \|
	\|--------\|-----------\|--------\|-------\|
	\| PER \| 0.76 \| 1.00 \| Perfect recall; FPs are email addresses misclassified as PER \|
	\| ORG \| 0.94 \| 0.96 \| Near-perfect \|
	\| LOC \| 0.52 \| 0.51 \| Boundary errors (detects street, misses house number) — detection rate is near-perfect \|

	LOC score reflects strict span matching. The model consistently detects location entities but predicts shorter spans (e.g., "Gothersgade" instead of "Gothersgade 81"). A post-processing step to extend LOC spans to include adjacent numbers resolves this.

	## Usage

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

	model_name = "thomasbeste/danish-xlmr-ner-large"
	nlp = pipeline("ner", model=model_name, aggregation_strategy="simple")

	text = "Anders Jensen fra Danske Bank bor på Vestergade 42 i København."
	entities = nlp(text)

	for ent in entities:
	print(f" {ent['entity_group']}: {ent['word']} (score: {ent['score']:.3f})")
	```

	### ONNX Deployment

	For production use, export to ONNX INT8 for ~3x CPU speedup:

	```bash
	pip install optimum[onnxruntime]

	# Export to ONNX
	optimum-cli export onnx --model thomasbeste/danish-xlmr-ner-large ./model-onnx --task token-classification

	# Quantize to INT8
	python -c "
	from optimum.onnxruntime import ORTQuantizer
	from optimum.onnxruntime.configuration import AutoQuantizationConfig
	q = ORTQuantizer.from_pretrained('./model-onnx')
	qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
	q.quantize(save_dir='./model-onnx-int8', quantization_config=qconfig)
	"
	```

	## Label Scheme

	IOB2 format with 7 labels:

	\| ID \| Label \|
	\|----\|-------\|
	\| 0 \| O \|
	\| 1 \| B-PER \|
	\| 2 \| I-PER \|
	\| 3 \| B-ORG \|
	\| 4 \| I-ORG \|
	\| 5 \| B-LOC \|
	\| 6 \| I-LOC \|

	## Intended Use

	Designed for GDPR-compliant PII detection in Danish enterprise document processing pipelines. Optimized for recall over precision — a missed entity (false negative) is a compliance risk, while over-detection (false positive) is safe.

	## Limitations

	- Optimized for Danish text. May work on other Scandinavian languages (Norwegian, Swedish) but not evaluated.
	- LOC boundary detection tends to predict shorter spans than the full address. Post-processing recommended.
	- Email addresses are sometimes misclassified as PER. Downstream validation (reject names containing `@`) is recommended.