Mirror rafmacalaba/gliner2-datause-large-v1-deval-synth-v2 -> production

9f17956 verified about 1 month ago

3.31 kB

	---
	library_name: gliner2
	license: mit
	base_model: fastino/gliner2-large-v1
	datasets:
	- ai4data/datause-train
	tags:
	- ner
	- data-mention-extraction
	- lora
	- gliner2
	- development-economics
	---

	# datause-extraction

	Fine-tuned GLiNER2 LoRA adapter for extracting structured data mentions from
	development economics and humanitarian research documents.

	This is the production release of
	[rafmacalaba/gliner2-datause-large-v1-deval-synth-v2](https://huggingface.co/rafmacalaba/gliner2-datause-large-v1-deval-synth-v2).

	## Task

	Given a passage of text, the model identifies every data source mentioned and
	classifies it across four dimensions:

	\| Field \| Type \| Values \|
	\|---\|---\|---\|
	\| `mention_name` \| Extractive span \| Verbatim text from the passage \|
	\| `specificity_tag` \| Classification \| `named` / `descriptive` / `vague` \|
	\| `typology_tag` \| Classification \| `survey` / `census` / `administrative` / `database` / `indicator` / `geospatial` / `microdata` / `report` / `other` \|
	\| `is_used` \| Classification \| `True` / `False` \|
	\| `usage_context` \| Classification \| `primary` / `supporting` / `background` \|

	## Inference — Two-Pass Hybrid

	This model uses a two-pass architecture. A single-pass structured extract
	will not produce correct results.

	```python
	from gliner2 import GLiNER2
	from huggingface_hub import snapshot_download

	# Install the patched GLiNER2 library:
	# pip install git+https://github.com/rafmacalaba/GLiNER2.git@feat/main-mirror

	BASE_MODEL = "fastino/gliner2-large-v1"
	ADAPTER_ID = "ai4data/datause-extraction"

	extractor = GLiNER2.from_pretrained(BASE_MODEL)
	extractor.load_adapter(snapshot_download(ADAPTER_ID))
	extractor.eval()

	CLASSIFICATION_TASKS = {
	"specificity_tag": ["named", "descriptive", "vague"],
	"typology_tag": [
	"survey", "census", "administrative", "database",
	"indicator", "geospatial", "microdata", "report", "other",
	],
	"is_used": ["True", "False"],
	"usage_context": ["primary", "supporting", "background"],
	}

	text = "We use the Demographic and Health Survey (DHS) 2020 as our primary data source."

	# Pass 1: entity extraction
	res_ent = extractor.extract_entities(text, ["data_mention"], threshold=0.3, include_confidence=True)
	spans = (
	res_ent.get("entities", {}).get("data_mention", [])
	if isinstance(res_ent, dict)
	else res_ent
	)

	# Build classification inputs for each valid span
	results = []
	for span_data in spans:
	span_text = span_data.get("text", "") if isinstance(span_data, dict) else str(span_data)
	span_conf = span_data.get("confidence", 0.0) if isinstance(span_data, dict) else 1.0
	if len(span_text) < 3:
	continue
	start = text.find(span_text)
	ctx_start = max(0, start - 150) if start != -1 else 0
	ctx_end = min(len(text), start + len(span_text) + 150) if start != -1 else len(text)
	context_str = f"Mention: {span_text} \| Context: {text[ctx_start:ctx_end]}"

	# Pass 2: classify the span's context window
	classes = extractor.classify_text(context_str, CLASSIFICATION_TASKS, threshold=0.3)
	mention = {"mention_name": span_text, "confidence": span_conf}
	for task, out in classes.items():
	mention[task] = out[0] if isinstance(out, tuple) and len(out) == 2 else out
	results.append(mention)

	print(results)
	```