SpatialWhisperer

A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

This checkpoint (seed 0) is from the ICML 2026 paper Transitive Representation Learning Enhances Histopathology Annotation (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the Trimodal model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.

Architecture

Modality Encoder Status
Image (H&E) UNI2 (MahmoodLab/UNI2-h) locked
Transcriptome Geneformer-12L-30M locked
Text BioBERT v1.1 trained

Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.

Training data

Three paired-modality datasets:

  • HEST-1K — H&E ↔ spatial gene expression (Visium-style spots)
  • CellxGene Census — gene expression ↔ free-text cell/sample metadata
  • ARCHS4/GEO — gene expression ↔ free-text sample descriptions

Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.

What this checkpoint contains

  • spatialwhisperer.ckpt — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the hyper_parameters block. Optimizer/scheduler state is stripped.

The locked foundation-model weights are NOT included. UNI2 and Geneformer are re-instantiated at load time from their original providers. The load_spatialwhisperer_model() convenience wrapper fetches both on first call.

Usage

Install the code repository (pixi env), then:

from spatialwhisperer import load_spatialwhisperer_model

model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)

First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.

Each get_<modality>_features call returns (pooled_features, projected_embed_in_shared_space). The second element is the 512-D shared-space embedding to compare across modalities.

import torch

prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    _, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape)  # (2, 512)

Image and transcriptome embeddings follow the same pattern; see the GitHub README for complete examples covering all three modalities.

Foundation-model setup

UNI2 (MahmoodLab/UNI2-h) is a gated HuggingFace model. Before first use:

  1. Accept the terms at https://huggingface.co/MahmoodLab/UNI2-h.
  2. Make a read token visible to your environment — the loader checks HF_TOKEN / HUGGINGFACE_TOKEN, otherwise falls back to huggingface_hub's on-disk cache. The simplest setup is:
    huggingface-cli login    # paste your read token
    

Geneformer downloads without gating.

Evaluation

The accompanying code repository reproduces every paper benchmark with one command: pixi run snakemake -j N paper_all. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):

Benchmark Macro AUROC (this ckpt, seed 0)
PathoCell CRC (13-class) 0.630
Lizard (3-class reduced) 0.764
PanNuke (4-class reduced) 0.689
Kriegsmann Skin Conditions (16-class, clinical labels) 0.698

These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.

Intended use & limitations

Intended: research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

Not intended: clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

Limitations:

  • Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
  • BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
  • The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

License

CC BY-NC 4.0 (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.

Citation

@inproceedings{schaefer2026spatialwhisperer,
  title     = {Transitive Representation Learning Enhances Histopathology Annotation},
  author    = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  address   = {Seoul, South Korea},
  month     = jul,
  year      = {2026},
  url       = {https://openreview.net/forum?id=Ze7U293Zw4}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support