Instructions to use xumingjun/mention2meddra-macbert-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xumingjun/mention2meddra-macbert-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="xumingjun/mention2meddra-macbert-base")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("xumingjun/mention2meddra-macbert-base") model = AutoModelForSequenceClassification.from_pretrained("xumingjun/mention2meddra-macbert-base") - Notebooks
- Google Colab
- Kaggle
mention2meddra-macbert-base
This repository contains the fine-tuned MacBERT cross-encoder used as the second-stage reranker in a Chinese adverse drug reaction (ADR) mention-to-MedDRA normalization workflow. The first stage retrieves candidate Preferred Terms (PTs); this model scores each mention-candidate pair and outputs the probability that the candidate PT is a correct mapping for the mention.
The model weights and tokenizer files in this repository are publicly released under the Apache License 2.0.
Model Details
- Base model:
hfl/chinese-macbert-base - Architecture:
BertForSequenceClassification - Task: binary mention-candidate pair classification
- Positive label:
match - Negative label:
not_match - Maximum sequence length used in the study: 64 tokens
- Candidate context: PT name, LLT aliases, HLT, HLGT, and SOC
- Recommended mention-level decoding threshold: 0.3 after top-100 BM25 retrieval
Training Data Boundary
The model was fine-tuned on an expert-annotated Chinese ADR normalization corpus derived from a regional pharmacovigilance system. This public model repository does not contain real adverse-event records, expert annotation files, licensed MedDRA dictionary files, source database extracts, or downstream signal-detection datasets.
Full study reproduction requires the original source data, licensed terminology resources, the candidate retrieval stage, and the public code package.
Evaluation
Held-out test performance reported for the associated Journal of Biomedical Informatics submission:
| Metric | Value |
|---|---|
| Pair-level accuracy | 0.983664 |
| Pair-level precision | 0.962865 |
| Pair-level recall | 0.955054 |
| Pair-level F1 | 0.958943 |
| Pair-level ROC-AUC | 0.996635 |
| Mention-level exact set match | 0.895782 |
| Mention-level micro-F1 | 0.958681 |
| Mention-level top-1 accuracy | 0.977978 |
| Mention-level Recall@3 | 0.983561 |
| Mention-level Recall@5 | 0.999690 |
These metrics were obtained within the study corpus and evaluation protocol. External validation is required before applying the model to other regions, institutions, drug classes, MedDRA versions, or operational workflows.
Intended Use
The model is intended for research use as a reranking component in Chinese ADR terminology normalization. It should be used with a candidate generator and licensed terminology resources. It is not a standalone clinical decision system and should not be used for clinical, regulatory, or safety actions without independent validation and appropriate expert review.
License
The model weights, configuration files, tokenizer files, and model card in this repository are released under the Apache License 2.0. The license applies to the released artifacts in this repository and does not grant rights to non-distributed source datasets or licensed terminology resources.
Input Format
The model expects a text pair:
- sequence A: the Chinese ADR mention or raw report phrase
- sequence B: a rendered candidate PT context containing PT, LLT, HLT, HLGT, and SOC fields
The public code repository contains the template and evaluation utilities:
- Downloads last month
- 25
Model tree for xumingjun/mention2meddra-macbert-base
Base model
hfl/chinese-macbert-baseEvaluation results
- Pair-level ROC-AUC on Expert-annotated Chinese ADR normalization corpusself-reported0.997
- Pair-level F1 on Expert-annotated Chinese ADR normalization corpusself-reported0.959
- Mention-level exact set match on Expert-annotated Chinese ADR normalization corpusself-reported0.896
- Mention-level micro-F1 on Expert-annotated Chinese ADR normalization corpusself-reported0.959