File size: 6,119 Bytes
3caa347
 
 
2310750
 
 
 
 
 
 
 
3caa347
 
 
 
 
2310750
3caa347
2310750
3caa347
2310750
 
3caa347
2310750
3caa347
2310750
 
 
 
 
3caa347
2310750
3caa347
 
 
2310750
3caa347
 
2310750
3caa347
 
2310750
3caa347
2310750
 
 
3caa347
2310750
3caa347
2310750
3caa347
2310750
3caa347
2310750
 
 
 
 
 
3caa347
2310750
3caa347
2310750
3caa347
 
2310750
3caa347
2310750
 
 
 
 
 
3caa347
 
2310750
3caa347
2310750
 
 
 
 
 
 
 
 
3caa347
2310750
 
 
 
 
 
 
 
 
 
 
 
 
 
3caa347
 
 
2310750
3caa347
2310750
3caa347
2310750
3caa347
 
 
 
2310750
3caa347
2310750
3caa347
 
 
 
 
574879c
 
 
 
 
 
 
 
 
 
3caa347
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: cc-by-nc-4.0
tags:
- biology
- histopathology
- spatial-transcriptomics
- multimodal
- pathology
- gene-expression
- biobert
- vision-language
library_name: pytorch
---

# SpatialWhisperer

A trimodal embedding model that aligns hematoxylin & eosin (H&E) image patches, gene-expression profiles, and free-text descriptions into a shared 512-dimensional space. Enables zero-shot cell-type annotation of H&E patches and natural-language querying over histopathology and spatial transcriptomics data.

This checkpoint (seed 0) is from the ICML 2026 paper *[Transitive Representation Learning Enhances Histopathology Annotation](https://openreview.net/forum?id=Ze7U293Zw4)* (Schaefer et al., PMLR vol. 306). The paper refers to this configuration as the **Trimodal** model: three encoders trained on two paired-modality datasets (transcriptome↔text and image↔transcriptome) that together span three modalities.

- **Code & reproduction pipeline:** <https://github.com/zinagoodlab/spatialwhisperer>
- **Paper:** <https://openreview.net/forum?id=Ze7U293Zw4>

## Architecture

| Modality | Encoder | Status |
|----------|---------|--------|
| Image (H&E) | UNI2 (`MahmoodLab/UNI2-h`) | locked |
| Transcriptome | Geneformer-12L-30M | locked |
| Text | BioBERT v1.1 | trained |

Three projection heads map each encoder's pooled features into a shared 512-dimensional space. Only the text tower and the three projection heads are trained.

## Training data

Three paired-modality datasets:

- **HEST-1K** — H&E ↔ spatial gene expression (Visium-style spots)
- **CellxGene Census** — gene expression ↔ free-text cell/sample metadata
- **ARCHS4/GEO** — gene expression ↔ free-text sample descriptions

Training: 4 epochs, AdamW (lr 1e-5), cosine schedule (3% warmup), batch size 512, single H100. This checkpoint is from epoch 3, global step 14624.

## What this checkpoint contains

- `spatialwhisperer.ckpt` — Lightning state-dict (~530 MB, 236 tensors: trained BioBERT text tower + three projection heads) plus the `hyper_parameters` block. Optimizer/scheduler state is stripped.

**The locked foundation-model weights are NOT included.** UNI2 and Geneformer are re-instantiated at load time from their original providers. The `load_spatialwhisperer_model()` convenience wrapper fetches both on first call.

## Usage

Install the [code repository](https://github.com/zinagoodlab/spatialwhisperer) (pixi env), then:

```python
from spatialwhisperer import load_spatialwhisperer_model

model, tokenizer, transcriptome_processor, image_processor = load_spatialwhisperer_model()
# model: TranscriptomeTextDualEncoderLightning (frozen, eval mode, on CUDA if available)
```

First call downloads the SpatialWhisperer checkpoint plus UNI2 and Geneformer weights; subsequent calls load from cache.

Each `get_<modality>_features` call returns `(pooled_features, projected_embed_in_shared_space)`. The second element is the 512-D shared-space embedding to compare across modalities.

```python
import torch

prompts = ["cytotoxic T cells", "plasma cells"]
text_inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    _, text_emb = model.model.get_text_features(normalize_embeds=True, **text_inputs)
print(text_emb.shape)  # (2, 512)
```

Image and transcriptome embeddings follow the same pattern; see the [GitHub README](https://github.com/zinagoodlab/spatialwhisperer#use-the-model) for complete examples covering all three modalities.

## Foundation-model setup

UNI2 (`MahmoodLab/UNI2-h`) is a gated HuggingFace model. Before first use:

1. Accept the terms at <https://huggingface.co/MahmoodLab/UNI2-h>.
2. Make a read token visible to your environment — the loader checks `HF_TOKEN` / `HUGGINGFACE_TOKEN`, otherwise falls back to `huggingface_hub`'s on-disk cache. The simplest setup is:
   ```bash
   huggingface-cli login    # paste your read token
   ```

Geneformer downloads without gating.

## Evaluation

The accompanying [code repository](https://github.com/zinagoodlab/spatialwhisperer) reproduces every paper benchmark with one command: `pixi run snakemake -j N paper_all`. Verified end-to-end on a fresh ILC ampere4 clone (2026-05-31):

| Benchmark | Macro AUROC (this ckpt, seed 0) |
|-----------|-----------------|
| PathoCell CRC (13-class) | 0.630 |
| Lizard (3-class reduced) | 0.764 |
| PanNuke (4-class reduced) | 0.689 |
| Kriegsmann Skin Conditions (16-class, clinical labels) | 0.698 |

These match the paper's reported Trimodal seed-0 numbers exactly. Comparisons to CONCH, PLIP, OmiCLIP, and other baselines are in the paper.

## Intended use & limitations

**Intended:** research on multimodal histopathology, zero-shot cell-type annotation, spatial transcriptomics analysis, and natural-language querying over H&E and gene-expression data.

**Not intended:** clinical diagnosis or treatment decisions. The model was trained on academic datasets and is not validated for clinical use.

**Limitations:**
- Trained on Visium-scale spots (~55 μm); finer-grained image–expression alignment is not guaranteed.
- BioBERT vocabulary constrains the text tower; rare technical terms may be out-of-distribution.
- The image tower (UNI2) is locked; performance on tissue types poorly represented in UNI2's pretraining will be lower.

## License

**CC BY-NC 4.0** (research use). Foundation-model weights (UNI2, Geneformer, BioBERT) carry their own licenses; consult the upstream repositories.

## Citation

```bibtex
@inproceedings{schaefer2026spatialwhisperer,
  title     = {Transitive Representation Learning Enhances Histopathology Annotation},
  author    = {Schaefer, Moritz and Piran, Zoe and Walter, Nils Philipp and Awasthi, Animesh and Bock, Christoph and Leskovec, Jure and Good, Zinaida},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  series    = {Proceedings of Machine Learning Research},
  volume    = {306},
  publisher = {PMLR},
  address   = {Seoul, South Korea},
  month     = jul,
  year      = {2026},
  url       = {https://openreview.net/forum?id=Ze7U293Zw4}
}
```