File size: 6,119 Bytes

---
license: cc-by-nc-4.0
tags:
- biology
- histopathology
- spatial-transcriptomics
- multimodal
- pathology
- gene-expression
- biobert
- vision-language
library_name: pytorch
---

# SpatialWhisperer

A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.

- **Code & reproduction pipeline:** <https://github.com/zinagoodlab/spatialwhisperer>
- **Paper:** <https://openreview.net/forum?id=Ze7U293Zw4>

## Architecture

| Modality | Encoder | Status |
|----------|---------|--------|
| Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked |
| Transcriptome | Geneformer-12L-30M | locked |
| Text | BioBERT v1.1 | trained |

Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.

## Training data

Three paired-modality datasets:

- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
- **CellxGene Census** — gene expression ↔ free-text cell/sample metadata
- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions

Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.

## What this checkpoint contains

- `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped.

**The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call.

## Usage

Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then:

```python
from spatialwhisperer import load_spatialwhisperer_model

model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
```

First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.

Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities.

```python
import torch

prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    _, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape)  # (2, 512)
```

Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities.

## Foundation-model setup

UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use:

1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>.
2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is:
   ```bash
   huggingface-cli login    # paste your read token
   ```

Geneformer downloads without gating.

## Evaluation

The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):

| Benchmark | Macro AUROC (this ckpt, seed 0) |
|-----------|-----------------|
| PathoCell CRC (13-class) | 0.630 |
| Lizard (3-class reduced) | 0.764 |
| PanNuke (4-class reduced) | 0.689 |
| Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 |

These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.

## Intended use & limitations

**Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

**Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

**Limitations:**
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

## License

**CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.

## Citation

```bibtex
@inproceedings{schaefer2026spatialwhisperer,
  title     = {Transitive Representation Learning Enhances Histopathology Annotation},
  author    = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  address   = {Seoul, South Korea},
  month     = jul,
  year      = {2026},
  url       = {https://openreview.net/forum?id=Ze7U293Zw4}
}
```