ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
Paper • 2605.25388 • Published
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
ViroCaduceus is a Caduceus-based (Mamba-DNA) nucleotide language model pre-trained on the ViroBlend (ViroBland) corpus, a small (216 Mbp) mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.
It is released as part of the ViroBench benchmark for evaluating viral nucleotide foundation models.
| Item | Value |
|---|---|
| Architecture | Caduceus-Ph (d_model=256, Mamba backbone) |
| Pretraining data | ViroBlend (~216 Mbp) |
Install dependencies:
pip install torch transformers mamba-ssm causal-conv1d
Extract an embedding for a random DNA sequence:
python get_embedding.py
Or load in Python (base model + local pytorch_model.bin):
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
BASE = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"
# REPO = "YDXX/ViroCaduceus" # after uploading to Hugging Face
tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(BASE, trust_remote_code=True)
# load ViroCaduceus weights from pytorch_model.bin if needed (see get_embedding.py)
config.json — training export configpytorch_model.bin — fine-tuned backbone weightsget_embedding.py — minimal embedding demo@article{ye2026virobench,
title={ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks},
author={Ye, Dongxin and Hu, Fang and Hu, Han and Hu, Shu and Tan, Yang and Ouyang, Wanli and Li, Stan Z and Cui, Jie and Dong, Nanqing},
journal={arXiv preprint arXiv:2605.25388},
year={2026}
}