internvl3-2b-walk-lora-Epoch3-8500-v2

Fine-tuned LoRA adapter for InternVL3-2B trained on the WalkVLM dataset for visually impaired navigation assistance.

Model Details

Attribute	Value
Base Model	OpenGVLab/InternVL3-2B
Method	LoRA (Low-Rank Adaptation)
LoRA Rank	128
Target Modules	down_proj, gate_proj, k_proj, o_proj, q_proj, up_proj, v_proj
Task	Navigation hazard detection

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Load base model
model = AutoModel.from_pretrained(
    "OpenGVLab/InternVL3-2B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("OpenGVLab/InternVL3-2B", trust_remote_code=True)

# Download adapter weights
adapter_path = hf_hub_download("blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2", "adapter_model.safetensors")
adapter_weights = load_file(adapter_path)

# Merge LoRA weights
model_state = model.state_dict()
scaling = 1.0  # lora_alpha / lora_r = 128 / 128

for key in adapter_weights:
    if '.lora_A.' in key:
        lora_b_key = key.replace('.lora_A.', '.lora_B.')
        if lora_b_key in adapter_weights:
            model_key = key.replace('.lora_A.', '.').replace('base_model.model.', '')
            if model_key in model_state:
                lora_a = adapter_weights[key].float().to(model_state[model_key].device)
                lora_b = adapter_weights[lora_b_key].float().to(model_state[model_key].device)
                delta = torch.matmul(lora_b, lora_a) * scaling
                model_state[model_key] = model_state[model_key].float() + delta
                model_state[model_key] = model_state[model_key].to(torch.bfloat16)
    elif '.lora_B.' not in key:
        # Load other fine-tuned weights
        model_key = key.replace('base_model.model.', '')
        if model_key in model_state and model_state[model_key].shape == adapter_weights[key].shape:
            model_state[model_key] = adapter_weights[key].to(model_state[model_key].device)

model.load_state_dict(model_state)
model.eval()

# Inference
prompt = "Given the visual input from the user's forward perspective, generate exactly one short sentence to guide a visually impaired user by identifying critical obstacles or landmarks, describing their locations using clock directions relative to the user (12 o'clock is straight ahead), including relevant details such as size, material, or distance, and giving one clear action, while prioritizing immediate safety and avoiding any extra explanation."

response = model.chat(
    tokenizer=tokenizer,
    pixel_values=your_image_tensor,  # Preprocessed image
    question=prompt,
    generation_config=dict(max_new_tokens=256, do_sample=False)
)
print(response)

Training

Dataset: blind-assist/walk-train
Epochs: 3
Learning Rate: 4e-5
Batch Size: 1 (with gradient accumulation)

Citation

@misc{blindassist2024walkvlm,
  title={WalkVLM: Fine-tuned Vision-Language Model for Blind Navigation},
  author={Blind-Assist Team},
  year={2024},
  url={https://huggingface.co/blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2}
}

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blind-assist/internvl3-2b-walk-lora-Epoch3-8500-v2

Base model

OpenGVLab/InternVL3-2B-Pretrained

Finetuned

OpenGVLab/InternVL3-2B-Instruct

Finetuned

OpenGVLab/InternVL3-2B

Adapter

(3)

this model

blind-assist
/

internvl3-2b-walk-lora-Epoch3-8500-v2