Image-Text-to-Text
Transformers
Safetensors
English
qwen2_5_vl
multimodal
Pathology
conversational
text-generation-inference
Instructions to use WenchuanZhang/Patho-R1-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WenchuanZhang/Patho-R1-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="WenchuanZhang/Patho-R1-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-7B") model = AutoModelForImageTextToText.from_pretrained("WenchuanZhang/Patho-R1-7B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use WenchuanZhang/Patho-R1-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WenchuanZhang/Patho-R1-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WenchuanZhang/Patho-R1-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/WenchuanZhang/Patho-R1-7B
- SGLang
How to use WenchuanZhang/Patho-R1-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WenchuanZhang/Patho-R1-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WenchuanZhang/Patho-R1-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WenchuanZhang/Patho-R1-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WenchuanZhang/Patho-R1-7B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use WenchuanZhang/Patho-R1-7B with Docker Model Runner:
docker model run hf.co/WenchuanZhang/Patho-R1-7B
| license: cc-by-nc-nd-4.0 | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| tags: | |
| - multimodal | |
| - Pathology | |
| - arxiv:2505.11404 | |
| extra_gated_prompt: >- | |
| The Patho-R1-7B model and its associated materials are released under the CC-BY-NC-ND 4.0 license. | |
| Access is restricted to non-commercial, academic research purposes only, with proper citation required. | |
| Any commercial usage, redistribution, or derivative work (including training models based on this model or generating datasets from its outputs) | |
| is strictly prohibited without prior written approval. | |
| Users must register with an official institutional email address (generic domains such as @gmail, @qq, @hotmail, etc. will not be accepted). | |
| By requesting access, you confirm that your information is accurate and current, and that you agree to comply with all terms listed herein. | |
| If other members of your organization wish to use the model, they must register independently and agree to the same terms. | |
| extra_gated_fields: | |
| Full name (first and last): text | |
| Institutional affiliation (no abbreviations): text | |
| Role/Position: | |
| type: select | |
| options: | |
| - Faculty/Principal Investigator | |
| - PhD Student | |
| - Postdoctoral Researcher | |
| - Research Staff | |
| - Other | |
| Official institutional email (**must match your Hugging Face primary email; generic domains will be denied**): text | |
| Intended research use (be specific): text | |
| I agree to use this model only for non-commercial academic purposes: checkbox | |
| I agree not to redistribute this model or share it outside of my individual usage: checkbox | |
| I confirm that all submitted information is accurate and up to date: checkbox | |
| # Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | |
| \[[Arxiv](https://arxiv.org/abs/2505.11404)\] | \[[Github Repo](https://github.com/Wenchuan-Zhang/Patho-R1)] | \[[Cite](#citation❤️)\] | |
| ## Introduction📝 | |
| While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning. | |
| To address this gap, we introduce **Patho-R1-7B**, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. **Patho-R1-7B** is trained using a three-stage pipeline: | |
| 1. *Continued pretraining* on **3.5M pathology figure-caption pairs** for domain knowledge acquisition | |
| 2. *Supervised fine-tuning* on **500k expert-annotated Chain-of-Thought samples** to encourage reasoning | |
| 3. *Reinforcement learning* with **Group Relative Policy Optimization** to refine response quality | |
| Experimental results show that **Patho-R1-7B** achieves strong performance across key pathology tasks, including **multiple choice questions** and **visual question answering**, highlighting its potential for real-world pathology AI applications. | |
|  | |
| ### Quickstart🏃 | |
| Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: | |
| ```python | |
| from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor | |
| from qwen_vl_utils import process_vision_info | |
| model = Qwen2_5_VLForConditionalGeneration.from_pretrained( | |
| "WenchuanZhang/Patho-R1-7B", | |
| torch_dtype="auto", device_map="auto" | |
| ) | |
| processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-7B") | |
| # example question from Pathmmu-test-dataset | |
| # ground truth: D | |
| # Reasoning style options (choose one): | |
| # - Chain-of-Draft, a concise reasoning prompting strategy (COD): | |
| # You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer> | |
| # - Chain-of-Thought (COT): | |
| messages = [ | |
| { "role": "system", | |
| "content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"}, | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image", | |
| "image": "./images/example.jpg", | |
| }, | |
| {"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"}, | |
| ], | |
| } | |
| ] | |
| # Preparation for inference | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ) | |
| inputs = inputs.to(model.device) | |
| # Inference: Generation of the output | |
| generated_ids = model.generate(**inputs, max_new_tokens=2048) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| output_text = processor.batch_decode( | |
| generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False | |
| ) | |
| print(output_text) | |
| ``` | |
| ## Acknowledgements🎖 | |
| We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work: | |
| - [Qwen](https://github.com/QwenLM) for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities. | |
| - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) for document layout detection. | |
| - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for comprehensive optical character recognition. | |
| - [ModelScope Swift](https://github.com/modelscope/ms-swift) for efficient model serving and deployment tools. | |
| - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for robust LLM training and fine-tuning pipelines. | |
| - [VERL](https://github.com/volcengine/verl) for valuable visual-language pretraining resources. | |
| - [DeepSeek](https://github.com/deepseek-ai) for high-quality models and infrastructure supporting text understanding. | |
| We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible. | |
| ## Citation❤️ | |
| If you find our work helpful, a citation would be greatly appreciated: | |
| ``` | |
| @article{zhang2025patho, | |
| title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner}, | |
| author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong}, | |
| journal={arXiv preprint arXiv:2505.11404}, | |
| year={2025} | |
| } | |
| ``` |