internvl3-2b-walk-lora-Epoch3-8500-v2
Fine-tuned LoRA adapter for InternVL3-2B trained on the WalkVLM dataset for visually impaired navigation assistance.
Model Details
| Attribute | Value |
|---|---|
| Base Model | OpenGVLab/InternVL3-2B |
| Method | LoRA (Low-Rank Adaptation) |
| LoRA Rank | 128 |
| Target Modules | down_proj, gate_proj, k_proj, o_proj, q_proj, up_proj, v_proj |
| Task | Navigation hazard detection |
Usage
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
# Load base model
model = AutoModel.from_pretrained(
"OpenGVLab/InternVL3-2B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("OpenGVLab/InternVL3-2B", trust_remote_code=True)
# Download adapter weights
adapter_path = hf_hub_download("blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2", "adapter_model.safetensors")
adapter_weights = load_file(adapter_path)
# Merge LoRA weights
model_state = model.state_dict()
scaling = 1.0 # lora_alpha / lora_r = 128 / 128
for key in adapter_weights:
if '.lora_A.' in key:
lora_b_key = key.replace('.lora_A.', '.lora_B.')
if lora_b_key in adapter_weights:
model_key = key.replace('.lora_A.', '.').replace('base_model.model.', '')
if model_key in model_state:
lora_a = adapter_weights[key].float().to(model_state[model_key].device)
lora_b = adapter_weights[lora_b_key].float().to(model_state[model_key].device)
delta = torch.matmul(lora_b, lora_a) * scaling
model_state[model_key] = model_state[model_key].float() + delta
model_state[model_key] = model_state[model_key].to(torch.bfloat16)
elif '.lora_B.' not in key:
# Load other fine-tuned weights
model_key = key.replace('base_model.model.', '')
if model_key in model_state and model_state[model_key].shape == adapter_weights[key].shape:
model_state[model_key] = adapter_weights[key].to(model_state[model_key].device)
model.load_state_dict(model_state)
model.eval()
# Inference
prompt = "Given the visual input from the user's forward perspective, generate exactly one short sentence to guide a visually impaired user by identifying critical obstacles or landmarks, describing their locations using clock directions relative to the user (12 o'clock is straight ahead), including relevant details such as size, material, or distance, and giving one clear action, while prioritizing immediate safety and avoiding any extra explanation."
response = model.chat(
tokenizer=tokenizer,
pixel_values=your_image_tensor, # Preprocessed image
question=prompt,
generation_config=dict(max_new_tokens=256, do_sample=False)
)
print(response)
Training
- Dataset: blind-assist/walk-train
- Epochs: 3
- Learning Rate: 4e-5
- Batch Size: 1 (with gradient accumulation)
Citation
@misc{blindassist2024walkvlm,
title={WalkVLM: Fine-tuned Vision-Language Model for Blind Navigation},
author={Blind-Assist Team},
year={2024},
url={https://huggingface.co/blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2}
}
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2
Base model
OpenGVLab/InternVL3-2B-Pretrained Finetuned
OpenGVLab/InternVL3-2B-Instruct Finetuned
OpenGVLab/InternVL3-2B