Visual Instruction Pretraining for Domain-Specific Foundation Models

Official model weights and documentation for ViTP (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.

Introduction

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored.

ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.


Framework A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).

Synergy ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.


Pretrained Backbones

The following ViT-Large (300M) backbones are available in the repository:

Model Pretrain Domain Weights
ViTP_ViT_L_rs Remote Sensing Download
ViTP_ViT_L_med Medical Imaging Download

These weights are designed to be used as initializations for various downstream tasks, including:

  • Object Detection (via MMRotate)
  • Semantic Segmentation (via MMSegmentation)
  • Change Detection (via OpenCD)

For detailed installation and usage instructions, please refer to the official GitHub repository.

Citation

If you use this work in your research, please cite:

@article{Li_2025_ViTP,
  title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
  author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
  journal={arXiv},
  year={2025}
}

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should obtain formal permission from the authors.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for GreatBird/ViTP