mention2meddra-macbert-base

This repository contains the fine-tuned MacBERT cross-encoder used as the second-stage reranker in a Chinese adverse drug reaction (ADR) mention-to-MedDRA normalization workflow. The first stage retrieves candidate Preferred Terms (PTs); this model scores each mention-candidate pair and outputs the probability that the candidate PT is a correct mapping for the mention.

The model weights and tokenizer files in this repository are publicly released under the Apache License 2.0.

Model Details

  • Base model: hfl/chinese-macbert-base
  • Architecture: BertForSequenceClassification
  • Task: binary mention-candidate pair classification
  • Positive label: match
  • Negative label: not_match
  • Maximum sequence length used in the study: 64 tokens
  • Candidate context: PT name, LLT aliases, HLT, HLGT, and SOC
  • Recommended mention-level decoding threshold: 0.3 after top-100 BM25 retrieval

Training Data Boundary

The model was fine-tuned on an expert-annotated Chinese ADR normalization corpus derived from a regional pharmacovigilance system. This public model repository does not contain real adverse-event records, expert annotation files, licensed MedDRA dictionary files, source database extracts, or downstream signal-detection datasets.

Full study reproduction requires the original source data, licensed terminology resources, the candidate retrieval stage, and the public code package.

Evaluation

Held-out test performance reported for the associated Journal of Biomedical Informatics submission:

Metric Value
Pair-level accuracy 0.983664
Pair-level precision 0.962865
Pair-level recall 0.955054
Pair-level F1 0.958943
Pair-level ROC-AUC 0.996635
Mention-level exact set match 0.895782
Mention-level micro-F1 0.958681
Mention-level top-1 accuracy 0.977978
Mention-level Recall@3 0.983561
Mention-level Recall@5 0.999690

These metrics were obtained within the study corpus and evaluation protocol. External validation is required before applying the model to other regions, institutions, drug classes, MedDRA versions, or operational workflows.

Intended Use

The model is intended for research use as a reranking component in Chinese ADR terminology normalization. It should be used with a candidate generator and licensed terminology resources. It is not a standalone clinical decision system and should not be used for clinical, regulatory, or safety actions without independent validation and appropriate expert review.

License

The model weights, configuration files, tokenizer files, and model card in this repository are released under the Apache License 2.0. The license applies to the released artifacts in this repository and does not grant rights to non-distributed source datasets or licensed terminology resources.

Input Format

The model expects a text pair:

  • sequence A: the Chinese ADR mention or raw report phrase
  • sequence B: a rendered candidate PT context containing PT, LLT, HLT, HLGT, and SOC fields

The public code repository contains the template and evaluation utilities:

https://github.com/xumingjun5208/mention2meddra

Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xumingjun/mention2meddra-macbert-base

Finetuned
(63)
this model

Evaluation results

  • Pair-level ROC-AUC on Expert-annotated Chinese ADR normalization corpus
    self-reported
    0.997
  • Pair-level F1 on Expert-annotated Chinese ADR normalization corpus
    self-reported
    0.959
  • Mention-level exact set match on Expert-annotated Chinese ADR normalization corpus
    self-reported
    0.896
  • Mention-level micro-F1 on Expert-annotated Chinese ADR normalization corpus
    self-reported
    0.959