YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
InternVL2.5-1B – Stage 2 (Impression)
1. Model Overview
This model is part of a Vision-Language AI system designed for chest X-ray analysis in Vietnamese clinical settings.
The full pipeline consists of 3 stages:
- Stage 1: Findings generation (image → radiology findings)
- Stage 2: Impression generation (findings → clinical impression)
- Stage 3: Multi-turn conversation (findings + impression + dialogue)
This repository corresponds to:
- Stage: 2 (Impression)
- Task: Generate clinical impression (final conclusion) from chest X-ray images
- Domain: Vietnamese medical imaging (Chest X-ray)
The model generates the final diagnostic impression directly from the input image.
The model is fine-tuned from a vision-language backbone and evaluated against multiple architectures (InternVL, Vintern, Qwen2-VL, MiniCPM-V, LaVy).
Among all models, Qwen2-VL-7B achieved the best performance, but this model is provided for benchmarking and comparison.
2. Installation
pip install torch torchvision transformers decord pillow
3. Inference
GPU with bfloat16 is recommended.
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1)
for i in range(1, n + 1)
for j in range(1, n + 1)
if i * j <= max_num and i * j >= min_num
)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
split_img = resized_img.crop(box)
processed_images.append(split_img)
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
path = "THP2903/InternVL2_5-1B_impression_v2"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
pixel_values = load_image("your_image.jpg", max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=512, do_sample=True)
question = "<image>\nẢnh chụp xray benh nhân nam, 48 tuổi PA ket luan bị gì?"
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
4. Notes
- Input must be a chest X-ray image
- Output is the final clinical impression (diagnostic conclusion)
- This model uses the original inference pipeline without modification
- For best performance, consider using Qwen2-VL-7B
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support