BERT-ESI-Triage v49 — Emergency Severity Index Decision-Support Model

A multi-head BiomedBERT model that predicts Emergency Severity Index (ESI) acuity 1–5 from emergency-department triage notes. Designed as a decision-support tool for triage nurses, not as an autonomous diagnostic system.

Model card status: Pre-release research artifact. Independent clinical validation has not been performed. Not approved for clinical deployment.

Critical safety disclosures

Read this section in full before any evaluation or use.

1. Documented racial bias signal (sickle-cell disease)

A controlled probe found a ~30 percentage-point ESI 1–2 catch-rate gap on sickle-cell disease (SCD) presentations across racial mention patterns (n=4 templates per group; small-n caveat). Subsequent paraphrase testing on a varied phrasing set reduced the apparent gap to ~3–7 pp. Interpretation: SCD recognition appears phrasing-sensitive in a way that interacts with race-coded language. Implication: Any pre-deployment fairness audit must include race-stratified SCD evaluation; we have not ruled out a real but smaller bias.

2. Documented socioeconomic-status bias

"Homeless" and other unhoused-status mentions correlated with 9–36 pp under-triage on presentations matched for clinical severity. The model has not learned to ignore SES-coded vocabulary. Implication: Site-level monitoring of triage decisions stratified by SES proxy variables is required.

3. Gender bias on atypical presentations

On atypical-presentation MI (fatigue, abdominal pain) the gender gap was ~3 pp under-triage for women, consistent with the published EM literature on under-recognition of female ACS. Implication: Atypical MI recognition remains incomplete.

4. Inter-site label variance (a ceiling, not a bug)

Triage labels disagree between sites by up to 58 pp on identical-content chief complaints (e.g. chest pain ESI 1–2: MIMIC Boston 82% vs Stanford MC-MED 24%). The model is calibrated against MIMIC-IV-ED; predictions on other sites should be expected to differ by similar margins. Implication: Per-site calibration is required before deployment at any new site.

5. Pediatric under-representation

Pediatric records are 0.002% of training data. Pediatric ESI 5 recall lower bound: 0%. Implication: This model is not intended for pediatric triage. Adult patients only.

6. Format-as-signal failure

Controlled experiments demonstrate that telegraphic format alone shifts predictions toward ESI 1 by 71% in adversarial pair tests, even when clinical content is matched. Adding "mild" or "well-appearing" to a telegraphic note can increase ESI 1 confidence. Implication: Format-sensitive over-triage is a known failure mode; ensemble with deterministic ESI-handbook engine recommended for production.

Reported holdout performance

Per-corpus, after per-dialect calibration (compact +1.0, narrative 0) and ESI 1 logit bias (validated Bayes-optimal cost-sensitive recipe; Elkan 2001).

Eval corpus	n	Overall exact	ESI 1 recall
MIMIC-IV-ED holdout	19,876	54.7%	84.6%
MC-MED clean	11,521	58.9%	89.0%
MIETIC clean	1,200	83.5%	86.7%
Lukina v3	1,026	49.3%	94.3%

Headline framing — read with caution. MIMIC overall exact 54.7% reflects a recall-optimized calibration; F1 max is at bias = 0 (F1 ≈ 58.7%). The +1.0 bias trades 6 pp exact for 25–30 pp ESI 1 recall, a deliberate safety choice (see "Calibration rationale" below).

Per-class recall and ECE

Macro per-class recall is provided in the associated metrics JSON. Calibration ECE per corpus: MC-MED 0.059 (best), pediatric/anaphylaxis 0.19–0.25 (poor). The +1.0 bias does not improve probability quality; applications requiring calibrated probabilities (not just argmax) should apply per-dialect temperature scaling.

Architecture

19-head BiomedBERT (BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext encoder, 110M parameters):

Primary inference head: esi_head (5-way softmax over ESI 1–5)
Display heads (3): symptom (176-dim multi-label), resource (12-dim), resource_count_bucket (3-way)
Perception heads (8): vitals, flags (3-dim with rare-positive pos_weight), NER (21 BIO tags), pain, age, arrival_mode, gender, medrecon
Safety heads (2): airway, resuscitation (rare-positive, pos_weight 2500 / 200 respectively)
Auxiliary heads (5): gestalt, disposition, syndrome, history visits/admits, last-admission-diagnosis

The model was trained jointly on all 19 heads with a weighted loss (esi=2.0, symptom=1.5, others 0.1–1.0), label smoothing 0.05 on the ESI head, and per-symptom positive weights from the v6 loss-weights configuration. Training corpus is documented under "Training data" below.

Calibration rationale (per-dialect logit bias)

Production inference applies a +1.0 logit bias to the ESI 1 class on compact dialect inputs and 0 on narrative dialect inputs. This is a derived Bayes-optimal cost-sensitive decision rule (Elkan, 2001):

bias_k = log(C_FN_k / C_FP_k)

A +1.0 bias on ESI 1 corresponds to a 2.7× cost ratio for false-negative (missed critical) vs false-positive (over-triage) errors — a clinical trade-off appropriate for triage decision support. This is not an inference patch: it is a documented production calibration of the softmax decision boundary, applied post-training, without modifying the argmax probability ordering on inputs not subject to dialect routing.

Per-dialect application avoids the regression observed when applying a uniform bias to in-distribution narrative inputs (which do not benefit from the recall lift).

Recommended inference stack

For safety-critical triage decision support, the recommended stack is three-layer:

BERT v49 with per-dialect calibration (this model)
min(BERT, deterministic ESI-handbook engine) ensemble — the engine reproduces the AHRQ ESI v4 handbook Steps A/B1/B2/B3/C/D deterministically and catches concept-grounded acute presentations (e.g. cardiac arrest, septic shock) that BERT alone has been observed to miss-classify (see "Format-as-signal failure" disclosure above).
Site-level fairness monitoring stratified by SES, race, and gender proxies (see disclosures 1–3 above).

The deterministic engine is provided in the training repository for reference but is not bundled with this model artifact.

Inputs

The model expects a single English-language text input — the chief complaint plus any structured vital signs, history, or arrival mode, serialized to a string. Maximum input length is 512 BiomedBERT tokens (longer inputs are truncated; chief complaint should appear in the first 500 characters).

Outputs

Primary: 5-way softmax over ESI 1–5 (after calibration). Auxiliary: 18 additional heads predicting symptoms, resources, vitals, flags, etc. (multi-label or class as appropriate per head).

Known limitations

Adult-only. Pediatric ESI 5 recall lower bound is 0%; pediatric records are 0.002% of training. Do not use for pediatric triage.
Vital-sparse robustness. ESI 1 recall on records with missing vitals is materially lower than the headline number.
Single-rater training labels. Triage labels are subject to known inter-rater variance (kappa 0.5–0.7); multi-rater validation has not been performed at the training corpus scale.
Site-specific calibration. The model is calibrated against MIMIC-IV-ED dialect distribution. Deployment at a new site requires per-site bias re-tuning.
Synthetic ESI 5 narrative dialect is currently synthesized from MIMIC paraphrase patterns; real-world ESI 5 narrative diversity (refills, dressing changes, wellness checks) is under-represented.
Bias-sensitive demographics. Race, SES, and gender disclosures above apply; this model has not been fairness-audited at deployment sites.

Recommended use cases

Decision-support overlay for triage nurses with full transparency (display predicted symptoms, resources, vitals — all 19 heads)
Quality-assurance cross-check on retrospective ED triage data
Research benchmark for ESI prediction methodology

Use cases NOT recommended

Autonomous triage decisions without a clinician in the loop
Pediatric triage
Deployment at any site without prior per-site calibration audit
Decisions involving SES, race, or gender-sensitive populations without site-level fairness monitoring
Any deployment without the recommended three-layer inference stack (BERT + engine + fairness monitoring)

Training data

The model was trained on a curated mixture of publicly-available and research-use clinical text. Training corpus distribution (approximate):

MIMIC-IV-ED triage notes (PhysioNet credentialed access): 227,916 records
MC-MED triage notes (Stanford StaRR, research use): 116,854 records
MIETIC narrative paraphrases (research): 5,399 records
ER-REASON discharge summaries (UCSF academic narrative): 18,120 records
medgemma-grounded synthetic dialect bridge: 26,344 records

Training data is not redistributed with this model. Users requiring training data must obtain it through the original distributors with appropriate data use agreements (DUAs).

License

Apache 2.0 for the model weights and configuration.

Note: training data licenses are separate and apply to the original data distributors (PhysioNet, Stanford StaRR, etc.); this model does not redistribute training data.

Citation

If you use this model, please cite the associated technical report. Preprint forthcoming; please contact the model authors for current citation pending publication.

Contact

Model authors: see Hugging Face profile.

For deployment-related questions including site calibration, fairness-audit guidance, or the recommended three-layer inference stack, please contact prior to any clinical use.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for vadimbelsky/bert-esi-triage-v49

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(159)

this model