mention2meddra-macbert-base

This repository contains the fine-tuned MacBERT cross-encoder used as the second-stage reranker in a Chinese adverse drug reaction (ADR) mention-to-MedDRA normalization workflow. The first stage retrieves candidate Preferred Terms (PTs); this model scores each mention-candidate pair and outputs the probability that the candidate PT is a correct mapping for the mention.

The model weights and tokenizer files in this repository are publicly released under the Apache License 2.0.

Model Details

Base model: hfl/chinese-macbert-base
Architecture: BertForSequenceClassification
Task: binary mention-candidate pair classification
Positive label: match
Negative label: not_match
Maximum sequence length used in the study: 64 tokens
Candidate context: PT name, LLT aliases, HLT, HLGT, and SOC
Recommended mention-level decoding threshold: 0.3 after top-100 BM25 retrieval

Training Data Boundary

The model was fine-tuned on an expert-annotated Chinese ADR normalization corpus derived from a regional pharmacovigilance system. This public model repository does not contain real adverse-event records, expert annotation files, licensed MedDRA dictionary files, source database extracts, or downstream signal-detection datasets.

Full study reproduction requires the original source data, licensed terminology resources, the candidate retrieval stage, and the public code package.

Evaluation

Held-out test performance reported for the associated Journal of Biomedical Informatics submission:

Metric	Value
Pair-level accuracy	0.983664
Pair-level precision	0.962865
Pair-level recall	0.955054
Pair-level F1	0.958943
Pair-level ROC-AUC	0.996635
Mention-level exact set match	0.895782
Mention-level micro-F1	0.958681
Mention-level top-1 accuracy	0.977978
Mention-level Recall@3	0.983561
Mention-level Recall@5	0.999690

These metrics were obtained within the study corpus and evaluation protocol. External validation is required before applying the model to other regions, institutions, drug classes, MedDRA versions, or operational workflows.

Intended Use

The model is intended for research use as a reranking component in Chinese ADR terminology normalization. It should be used with a candidate generator and licensed terminology resources. It is not a standalone clinical decision system and should not be used for clinical, regulatory, or safety actions without independent validation and appropriate expert review.

License

The model weights, configuration files, tokenizer files, and model card in this repository are released under the Apache License 2.0. The license applies to the released artifacts in this repository and does not grant rights to non-distributed source datasets or licensed terminology resources.

Input Format

The model expects a text pair:

sequence A: the Chinese ADR mention or raw report phrase
sequence B: a rendered candidate PT context containing PT, LLT, HLT, HLGT, and SOC fields

The public code repository contains the template and evaluation utilities:

https://github.com/xumingjun5208/mention2meddra

Downloads last month: 25

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for xumingjun/mention2meddra-macbert-base

Base model

hfl/chinese-macbert-base

Finetuned

(63)

this model

Evaluation results

Pair-level ROC-AUC on Expert-annotated Chinese ADR normalization corpus
self-reported

0.997
Pair-level F1 on Expert-annotated Chinese ADR normalization corpus
self-reported

0.959
Mention-level exact set match on Expert-annotated Chinese ADR normalization corpus
self-reported

0.896
Mention-level micro-F1 on Expert-annotated Chinese ADR normalization corpus
self-reported

0.959