datause-extraction / README.md
rafmacalaba's picture
Mirror rafmacalaba/gliner2-datause-large-v1-deval-synth-v2 -> production
9f17956 verified
---
library_name: gliner2
license: mit
base_model: fastino/gliner2-large-v1
datasets:
- ai4data/datause-train
tags:
- ner
- data-mention-extraction
- lora
- gliner2
- development-economics
---
# datause-extraction
Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
development economics and humanitarian research documents.
This is the production release of
[rafmacalaba/gliner2-datause-large-v1-deval-synth-v2](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-deval-synth-v2).
## Task
Given a passage of text, the model identifies every data source mentioned and
classifies it across four dimensions:
| Field | Type | Values |
|---|---|---|
| `mention_name` | Extractive span | Verbatim text from the passage |
| `specificity_tag` | Classification | `named` / `descriptive` / `vague` |
| `typology_tag` | Classification | `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` |
| `is_used` | Classification | `True` / `False` |
| `usage_context` | Classification | `primary` / `supporting` / `background` |
## Inference — Two-Pass Hybrid
This model uses a **two-pass** architecture. A single-pass structured extract
will not produce correct results.
```python
from gliner2 import GLiNER2
from huggingface_hub import snapshot_download
# Install the patched GLiNER2 library:
# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror
BASE_MODEL = "fastino/gliner2-large-v1"
ADAPTER_ID = "ai4data/datause-extraction"
extractor = GLiNER2.from_pretrained(BASE_MODEL)
extractor.load_adapter(snapshot_download(ADAPTER_ID))
extractor.eval()
CLASSIFICATION_TASKS = {
"specificity_tag": ["named", "descriptive", "vague"],
"typology_tag": [
"survey", "census", "administrative", "database",
"indicator", "geospatial", "microdata", "report", "other",
],
"is_used": ["True", "False"],
"usage_context": ["primary", "supporting", "background"],
}
text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source."
# Pass 1: entity extraction
res_ent = extractor.extract_entities(text, ["data_mention"], threshold=0.3, include_confidence=True)
spans = (
res_ent.get("entities", {}).get("data_mention", [])
if isinstance(res_ent, dict)
else res_ent
)
# Build classification inputs for each valid span
results = []
for span_data in spans:
span_text = span_data.get("text", "") if isinstance(span_data, dict) else str(span_data)
span_conf = span_data.get("confidence", 0.0) if isinstance(span_data, dict) else 1.0
if len(span_text) < 3:
continue
start = text.find(span_text)
ctx_start = max(0, start - 150) if start != -1 else 0
ctx_end = min(len(text), start + len(span_text) + 150) if start != -1 else len(text)
context_str = f"Mention: {span_text} | Context: {text[ctx_start:ctx_end]}"
# Pass 2: classify the span's context window
classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3)
mention = {"mention_name": span_text, "confidence": span_conf}
for task, out in classes.items():
mention[task] = out[0] if isinstance(out, tuple) and len(out) == 2 else out
results.append(mention)
print(results)
```