File size: 6,119 Bytes
3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 2310750 3caa347 574879c 3caa347 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: cc-by-nc-4.0
tags:
- biology
- histopathology
- spatial-transcriptomics
- multimodal
- pathology
- gene-expression
- biobert
- vision-language
library_name: pytorch
---
# SpatialWhisperer
A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.
This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.
- **Code & reproduction pipeline:** <https://github.com/zinagoodlab/spatialwhisperer>
- **Paper:** <https://openreview.net/forum?id=Ze7U293Zw4>
## Architecture
| Modality | Encoder | Status |
|----------|---------|--------|
| Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked |
| Transcriptome | Geneformer-12L-30M | locked |
| Text | BioBERT v1.1 | trained |
Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.
## Training data
Three paired-modality datasets:
- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
- **CellxGene Census** — gene expression ↔ free-text cell/sample metadata
- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions
Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.
## What this checkpoint contains
- `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped.
**The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call.
## Usage
Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then:
```python
from spatialwhisperer import load_spatialwhisperer_model
model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
```
First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.
Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities.
```python
import torch
prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
_, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape) # (2, 512)
```
Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities.
## Foundation-model setup
UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use:
1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>.
2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is:
```bash
huggingface-cli login # paste your read token
```
Geneformer downloads without gating.
## Evaluation
The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):
| Benchmark | Macro AUROC (this ckpt, seed 0) |
|-----------|-----------------|
| PathoCell CRC (13-class) | 0.630 |
| Lizard (3-class reduced) | 0.764 |
| PanNuke (4-class reduced) | 0.689 |
| Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 |
These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.
## Intended use & limitations
**Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.
**Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.
**Limitations:**
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.
## License
**CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.
## Citation
```bibtex
@inproceedings{schaefer2026spatialwhisperer,
title = {Transitive Representation Learning Enhances Histopathology Annotation},
author = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
volume = {306},
publisher = {PMLR},
address = {Seoul, South Korea},
month = jul,
year = {2026},
url = {https://openreview.net/forum?id=Ze7U293Zw4}
}
```
|