SpatialWhisperer
A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
This checkpoint (seed 0) is from the ICML 2026 paper Transitive Representation Learning Enhances Histopathology Annotation (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the Trimodal model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.
- Code & reproduction pipeline: https://github.com/zinagoodlab/spatialwhisperer
- Paper: https://openreview.net/forum?id=Ze7U293Zw4
Architecture
| Modality | Encoder | Status |
|---|---|---|
| Image (H&E) | UNI2 (MahmoodLab/UNI2-h) |
locked |
| Transcriptome | Geneformer-12L-30M | locked |
| Text | BioBERT v1.1 | trained |
Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.
Training data
Three paired-modality datasets:
- HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
- CellxGene Census — gene expression ↔ free-text cell/sample metadata
- ARCHS4/GEO — gene expression ↔ free-text sample descriptions
Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.
What this checkpoint contains
spatialwhisperer.ckpt— Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus thehyper_parametersblock. Optimizer/scheduler state is stripped.
The locked foundation-model weights are NOT included. UNI2 and Geneformer are re-instantiated at load time from their original providers. The load_spatialwhisperer_model() convenience wrapper fetches both on first call.
Usage
Install the code repository (pixi env), then:
from spatialwhisperer import load_spatialwhisperer_model
model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.
Each get_<modality>_features call returns (pooled_features, projected_embed_in_shared_space). The second element is the 512-D shared-space embedding to compare across modalities.
import torch
prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
_, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape) # (2, 512)
Image and transcriptome embeddings follow the same pattern; see the GitHub README for complete examples covering all three modalities.
Foundation-model setup
UNI2 (MahmoodLab/UNI2-h) is a gated HuggingFace model. Before first use:
- Accept the terms at https://huggingface.co/MahmoodLab/UNI2-h.
- Make a read token visible to your environment — the loader checks
HF_TOKEN/HUGGINGFACE_TOKEN, otherwise falls back tohuggingface_hub's on-disk cache. The simplest setup is:huggingface-cli login # paste your read token
Geneformer downloads without gating.
Evaluation
The accompanying code repository reproduces every paper benchmark with one command: pixi run snakemake -j N paper_all. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):
| Benchmark | Macro AUROC (this ckpt, seed 0) |
|---|---|
| PathoCell CRC (13-class) | 0.630 |
| Lizard (3-class reduced) | 0.764 |
| PanNuke (4-class reduced) | 0.689 |
| Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 |
These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.
Intended use & limitations
Intended: research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
Not intended: clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
Limitations:
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
License
CC BY-NC 4.0 (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.
Citation
@inproceedings{schaefer2026spatialwhisperer,
title = {Transitive Representation Learning Enhances Histopathology Annotation},
author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
address = {Seoul, South Korea},
month = jul,
year = {2026},
url = {https://openreview.net/forum?id=Ze7U293Zw4}
}