| --- |
| library_name: gliner2 |
| license: mit |
| base_model: fastino/gliner2-large-v1 |
| datasets: |
| - ai4data/datause-train |
| tags: |
| - ner |
| - data-mention-extraction |
| - lora |
| - gliner2 |
| - development-economics |
| --- |
| |
| # datause-extraction |
|
|
| Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from |
| development economics and humanitarian research documents. |
|
|
| This is the production release of |
| [rafmacalaba/gliner2-datause-large-v1-deval-synth-v2](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-deval-synth-v2). |
|
|
| ## Task |
|
|
| Given a passage of text, the model identifies every data source mentioned and |
| classifies it across four dimensions: |
|
|
| | Field | Type | Values | |
| |---|---|---| |
| | `mention_name` | Extractive span | Verbatim text from the passage | |
| | `specificity_tag` | Classification | `named` / `descriptive` / `vague` | |
| | `typology_tag` | Classification | `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` | |
| | `is_used` | Classification | `True` / `False` | |
| | `usage_context` | Classification | `primary` / `supporting` / `background` | |
|
|
| ## Inference — Two-Pass Hybrid |
|
|
| This model uses a **two-pass** architecture. A single-pass structured extract |
| will not produce correct results. |
|
|
| ```python |
| from gliner2 import GLiNER2 |
| from huggingface_hub import snapshot_download |
| |
| # Install the patched GLiNER2 library: |
| # pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror |
| |
| BASE_MODEL = "fastino/gliner2-large-v1" |
| ADAPTER_ID = "ai4data/datause-extraction" |
| |
| extractor = GLiNER2.from_pretrained(BASE_MODEL) |
| extractor.load_adapter(snapshot_download(ADAPTER_ID)) |
| extractor.eval() |
| |
| CLASSIFICATION_TASKS = { |
| "specificity_tag": ["named", "descriptive", "vague"], |
| "typology_tag": [ |
| "survey", "census", "administrative", "database", |
| "indicator", "geospatial", "microdata", "report", "other", |
| ], |
| "is_used": ["True", "False"], |
| "usage_context": ["primary", "supporting", "background"], |
| } |
| |
| text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source." |
| |
| # Pass 1: entity extraction |
| res_ent = extractor.extract_entities(text, ["data_mention"], threshold=0.3, include_confidence=True) |
| spans = ( |
| res_ent.get("entities", {}).get("data_mention", []) |
| if isinstance(res_ent, dict) |
| else res_ent |
| ) |
| |
| # Build classification inputs for each valid span |
| results = [] |
| for span_data in spans: |
| span_text = span_data.get("text", "") if isinstance(span_data, dict) else str(span_data) |
| span_conf = span_data.get("confidence", 0.0) if isinstance(span_data, dict) else 1.0 |
| if len(span_text) < 3: |
| continue |
| start = text.find(span_text) |
| ctx_start = max(0, start - 150) if start != -1 else 0 |
| ctx_end = min(len(text), start + len(span_text) + 150) if start != -1 else len(text) |
| context_str = f"Mention: {span_text} | Context: {text[ctx_start:ctx_end]}" |
| |
| # Pass 2: classify the span's context window |
| classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3) |
| mention = {"mention_name": span_text, "confidence": span_conf} |
| for task, out in classes.items(): |
| mention[task] = out[0] if isinstance(out, tuple) and len(out) == 2 else out |
| results.append(mention) |
| |
| print(results) |
| ``` |
|
|