| --- |
| license: mit |
| tags: |
| - rna-seq |
| - bulk-rna |
| - cancer |
| - transcriptomics |
| - graph-neural-network |
| - transformer |
| - performer |
| - gcn |
| - pytorch |
| model_size: 48M |
| pipeline_tag: feature-extraction |
| library_name: pytorch |
| --- |
| |
| # 🧬 CancerTranscriptome-Mini-48M |
| *A compact, cancer-focused BulkFormer-style encoder for bulk RNA-seq* |
|
|
| **CancerTranscriptome-Mini-48M** is a lightweight derivative of **BulkFormer**, designed to learn cancer-specific transcriptomic structure from large-scale bulk RNA-seq. |
| It combines **GCN-based gene graph propagation**, **Rotary Expression Embeddings (REE)**, **local bin-wise Performer attention**, and **global Performer attention** into a single unified encoder. |
|
|
| This model is a proof-of-concept designed for research, experimentation, and rapid iteration on BulkFormer-style architectures applied to cancer transcriptomes. |
|
|
| --- |
|
|
| ## 🔬 Origin & References |
|
|
| ### **Primary Reference (BulkFormer)** |
| Boming Kang, Rui Fan, Meizheng Yi, Chunmei Cui, Qinghua Cui. |
| **“A large-scale foundation model for bulk transcriptomes.”** |
| bioRxiv (2025). |
| doi: https://doi.org/10.1101/2025.06.11.659222 |
|
|
| ### **This Model (CancerTranscriptome-Mini-48M)** |
| A compact re-implementation based on the BulkFormer architecture, adapted for cancer-only bulk RNA-seq and simplified for accessibility and compute efficiency. |
| Source Code: https://github.com/alwalt/BioFM |
|
|
| --- |
|
|
| # 📊 Data Source |
|
|
| All training samples originate from the **ARCHS4 Human RNA-seq v2.5** public repository: |
|
|
| **ARCHS4 Reference:** |
| Lachmann A., Torre D., Keenan A.B., Jagodnik K.M., et al. |
| **“Massive mining of publicly available RNA-seq data from human and mouse.”** |
| *Nature Communications* 9, 1366 (2018). |
| Dataset: https://maayanlab.cloud/archs4/ |
|
|
| ### **Filtering Procedure** |
| - Loaded all human bulk RNA-seq metadata from ARCHS4 v2.5 HDF5 |
| - Selected samples matching: |
| `cancer | tumor | carcinoma | leukemia | lymphoma | melanoma | glioma` |
| - Removed samples lacking clear disease annotations |
| - Used ARCHS4 log-TPM matrices (gene × sample) |
| - Final dataset: ~76k cancer samples, 19,357 genes |
|
|
| No private, clinical, controlled-access, or proprietary data were used. |
|
|
| --- |
|
|
| # 🧠 Model Architecture (Summary) |
|
|
| CancerTranscriptome-Mini-48M includes: |
|
|
| ### **1. Gene Identity Embeddings** |
| - Precomputed **ESM2 embeddings** for each protein-coding gene |
| - Projected into model dimension (320) |
|
|
| ### **2. Rotary Expression Embeddings (REE)** |
| - Deterministic sinusoidal continuous-value embedding |
| - Masked positions zeroed (mask token = –10) |
|
|
| ### **3. Graph Neural Network Layer** |
| - **GCNConv** (Kipf & Welling) applied on a curated gene-gene graph |
| - Injects biological prior knowledge |
|
|
| ### **4. Expression Binning** |
| - Learnable importance scores sort genes |
| - Genes divided into 10 bins |
| - Each bin receives its own **local Performer** attention |
|
|
| ### **5. Global Performer Attention** |
| - 2 stacked Performer layers across all genes |
|
|
| ### **6. Prediction Head** |
| - MLP → scalar value per gene |
| - Used for masked-expression reconstruction |
|
|
| Total parameters: **48,336,162 (~48M)** |
|
|
| --- |
|
|
| # 🎯 Intended Use |
|
|
| This model produces **context-aware gene embeddings** for downstream cancer transcriptomic tasks: |
|
|
| - Tumor subtype prediction |
| - Drug response modeling |
| - Immune infiltration scoring |
| - Survival / risk modeling |
| - Gene expression imputation |
| - Dimensionality reduction |
| - Transfer learning to TCGA, CCLE, DepMap, GEO tumor datasets |
|
|
| --- |
|
|
| # 🚀 How to Use |
|
|
| Download & run: |
|
|
| ```python |
| import torch |
| from model import BulkFormer # from this repo |
| import safetensors.torch as st |
| |
| # Load model + weights |
| model = BulkFormer( |
| dim=320, |
| graph=torch.load("edge_index.pt"), # provide your graph |
| gene_emb=torch.load("esm2_gene_emb.pt"), |
| gene_length=19357, |
| bin_head=8, |
| full_head=4, |
| bins=10, |
| gb_repeat=1, |
| p_repeat=2 |
| ) |
| |
| state = st.load_file("model.safetensors") |
| model.load_state_dict(state) |
| model.eval() |
| |
| # Example input: 19,357-gene log-TPM vector |
| x = torch.randn(1, 19357) |
| |
| with torch.no_grad(): |
| out = model(x) |
| |
| print(out.shape) # [1, 19357] |
| |