protonator-models
Minimal-dependency (torch + rdkit + numpy) D-MPNN model weights for
protonator. Each model is an ensemble (5-fold; tmelt_mpnn is a 10-model loss-diverse consensus);
protonator returns the mean prediction with the across-fold standard deviation as a
calibrated uncertainty. Weights are fetched automatically at a pinned revision via
huggingface_hub.
| Folder | Endpoint | Accuracy |
|---|---|---|
pka_mpnn/ |
microscopic (per-site) aqueous pKa | scaffold-5CV RMSE ~1.30, random-5CV ~1.08; Enamine external RMSE 0.55 (R² 0.96) |
logp_mpnn/ |
octanol–water logP | 5-fold CV RMSE 0.77, MAE 0.50, R² 0.86 |
logs_mpnn/ |
aqueous logS (log₁₀ mol/L, ~298 K) | 5-fold CV RMSE 0.54, MAE 0.35, R² 0.92 |
tmelt_mpnn/ |
melting temperature T_m (Kelvin) | scaffold-5CV RMSE 33.0 K, MAE 25.3 (R² 0.73); never-trained held-out RMSE 34.0 (Tetko subset 29.5) |
Each folder holds fold_*.pt + config.json (per-fold output denormalization and a
featurizer-version contract validated at load); tmelt_mpnn/ ships 10 folds plus a
desc_norm.json (descriptor standardization).
pka_mpnn — per-site pKa
Microscopic (per-ionization-site) aqueous pKa for drug-like small molecules, 2D-only (SMILES / molecular graph; no 3D conformers, no QM). Given a SMILES and an ionization-center (IC) atom, predicts that site's pKa with an ensemble uncertainty.
Architecture (per fold)
- depth-3 directed-bond D-MPNN, hidden 1024
- distance-conditioned IC-centric attention readout (
attention+dist): a learned shortest-path-distance bias routes any substituent — at any topological distance — to the ionization center in O(1), so a shallow model is sensitive to remote substituent effects - inductive descriptor at the IC (Taft σ_I / Swain–Lupton σ_F / Kier–Hall E-state, with no distance cutoff)
- dropout 0.15 + weight-decay; per-fold output denormalization; 5-fold ensemble
Benchmarks
Out-of-fold 5-fold cross-validation on a curated, residual-denoised combination of ChEMBL / iBonD / IUPAC experimental pKa (~17.6k per-site measurements):
| Split | RMSE | MAE |
|---|---|---|
| scaffold 5-fold CV | ~1.30 | ~0.95 |
| random 5-fold CV | ~1.08 | — |
Held-out external set (Enamine fluoro, 158 molecules, not used in the external evaluation):
| RMSE | MAE | R² | |
|---|---|---|---|
| Enamine fluoro | 0.55 | 0.40 | 0.96 |
Remote-substituent sensitivity (the headline improvement over the prior PKaGIN model):
| Probe | this model | prior PKaGIN |
|---|---|---|
| Hammett ρ (para-benzoic series; target 1.00) | → 1.0 | 0.34 (95% CI [−0.00, 0.77]) |
| fluoro matched-pair Δ sign-accuracy | 1.00 | 0.33 |
The prior model is degenerate on remote substituents (it predicts near-identical pKa for a molecule and an analog whose substituent lies beyond its receptive field); this model fixes that while matching/exceeding overall accuracy.
Required input standardization
The model was trained on neutral, desalted, largest-organic-fragment SMILES that were not
tautomer-canonicalized. protonator.predict_sites applies the matching standardization
(desalt + largest-fragment + neutralize, no tautomer canonicalization) before detecting
ionization centers, so charged species and salts are handled correctly. Do not bypass it for
arbitrary user input.
logp_mpnn — octanol/water logP
D-MPNN, 5-fold ensemble. 5-fold CV: RMSE 0.77, MAE 0.50, R² 0.86.
logs_mpnn — aqueous logS
Aqueous log solubility (log₁₀ mol/L, ~298 K); shares the D-MPNN trunk with logP, trained jointly. 5-fold CV: RMSE 0.54, MAE 0.35, R² 0.92.
tmelt_mpnn — melting temperature
Melting point T_m (Kelvin) for organic small molecules, 2D-only (SMILES / molecular
graph; no 3D conformers, no crystal structure). Shares the CheMeleon-initialized D-MPNN trunk
with logP/logS (hidden 2048, depth 6, mean aggregation) and adds descriptor infusion: 11
physically-grounded melting-point descriptors (topological symmetry number, conformational
flexibility, H-bond donors/acceptors, ring & aromatic rigidity, TPSA, size) are concatenated to
the pooled graph encoding before the FFN head. Deployed as a 10-model loss-diverse consensus
(MSE + Huber objectives × 5 scaffold folds); desc_norm.json ships the descriptor
standardization applied at inference.
Data
Forensically-cleaned 243k-molecule corpus combining a patent-mined set (214k) and the
Tetko/OCHEM literature set (36k). Multi-signal label QC (cross-validated model residual +
structural-neighbor consistency + scaffold consistency), chemist review, and a non-circular
drop-validation flagged and removed 6,748 corroborated bad labels (2.7%) — °F↔°C unit errors,
boiling/decomposition temperatures recorded as melting points, free-base/salt mismatches, and a
Tetko missing-value sentinel — while protecting genuinely high-melting aromatic polyacids.
Benchmarks
| Split | RMSE (K) | MAE (K) | R² |
|---|---|---|---|
| scaffold 5-fold CV (cleaned labels) | 33.0 | 25.3 | 0.73 |
| — Tetko subset | 31.8 | — | — |
| never-trained held-out (25k) | 34.0 | 23.9 | 0.65 |
| — Tetko subset | 29.5 | — | — |
Melting point is the hardest of the common physicochemical endpoints (it depends on crystal packing, which a single-molecule 2D graph cannot encode); the experimental inter-source noise floor on this kind of broad-range data is σ ≈ 35 K. ~33–34 K RMSE on trustworthy labels is therefore at the state-of-the-art frontier and matches/edges the best published consensus models on the Tetko benchmark.
Usage
protonator fetches these automatically (pinned revision). Manual load:
from protonator.ml.models.pka_mpnn import PKaPredictor
pred = PKaPredictor(weights_dir="<pka_mpnn folder>", device="cpu")
sites = pred.predict_sites("[Na+].CC(=O)[O-]") # auto-standardized -> Carboxylic Acid ~4.96
Accuracy figures are out-of-fold cross-validation on the experimental training data plus a held-out external set; they are not directly comparable across endpoints (different data and splits).
Citation
Isayev lab, protonator — https://github.com/isayevlab/protonator
solvation_mpnn (solvation free energy, dG_solv)
solvation_mpnn/ — solute-in-solvent solvation free energy (dG_solv, kcal/mol at 298.15 K).
Dual-encoder D-MPNN: separate solute and solvent encoders (hidden 2048, depth 6, 72-dim atom
features) feeding an FFN over both pooled vectors plus per-molecule RDKit SlogP_VSA descriptors
(4120 -> 1024 -> 1024 -> 1); 5-fold ensemble. Self-contained (torch + rdkit + numpy only).
| 5-fold CV (out-of-fold, 21,214 solute/solvent pairs) |
|---|
| dG_solv RMSE 0.95 / MAE 0.51 / R2 0.978 kcal/mol |
Also drives octanol-water LogP and arbitrary phase log-partition coefficients via the
thermodynamic cycle (dG_a - dG_b) / RT ln10. ensemble_fold_0.pt..ensemble_fold_4.pt
(bare state_dicts) + config.json (informational provenance; architecture is fixed in
protonator.ml.models._common.ENCODER_CONFIG, not parsed at load).