astroNLPy-ner

Named entity recognition for astronomical observation reports (ATels, GCN Circulars, TNS reports). Fine-tuned from adsabs/astroBERT on the TDAC corpus (Time-Domain Astronomy Corpus) with 27 astrophysical entity types.

This model is the NER component of the astroNLPy package, which also provides LLM-based coreference resolution and relation extraction for celestial objects.

Usage

With the astroNLPy package (recommended):

from astroNLPy.ner import NERModel

ner = NERModel.from_pretrained("atillaalkan/astroNLPy-ner")
tags = ner.predict_text("Swift observed GRS 1747-312 in the X-ray band.")

Or directly with transformers:

from transformers import pipeline

nlp = pipeline("token-classification", model="atillaalkan/astroNLPy-ner",
               aggregation_strategy="simple")
print(nlp("We report a nova in M31 at R = 19.7 mag."))

Entity types

CelestialObject, CelestialRegion, CelestialObjectRegion, Telescope, Observatory, Instrument, Survey, Wavelength, Formula, ObservationalTechniques, Citation, Dataset, Database, Archive, Software, URL, Person, Organization, Collaboration, Location, Grant, Proposal, Event, Model, Identifier, Tag, TextGarbage.

Results (v0.1.0)

Single 80/10/10 holdout split (8 test documents), seqeval / IOB2:

Metric	Value
Micro F1	0.52
CelestialObject F1	0.96
Person F1	0.92
Wavelength F1	0.68

Single-split result; the micro-average is depressed by entity types absent from the small test set.

Training

Base model: adsabs/astroBERT
10 epochs, batch size 8, learning rate 2e-5, IOB2 token classification
Corpus: TDAC (74 documents, ~19k tokens)

Citation

Publication will come soon.

License

MIT

Downloads last month: 27

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for atillaalkan/astroNLPy-ner

Base model

adsabs/astroBERT

Finetuned

(2)

this model