Instructions to use DAVIAN-Robotics/3D_HAMSTER with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DAVIAN-Robotics/3D_HAMSTER with Transformers:
# Load model directly from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained("DAVIAN-Robotics/3D_HAMSTER", dtype="auto") - Notebooks
- Google Colab
- Kaggle
3D HAMSTER : Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
🎉 Accepted to IROS 2026 — IEEE/RSJ International Conference on Intelligent Robots and Systems
📄 Paper · 🌐 Project Page · 💻 Code
3D HAMSTER is a depth-aware VLM planner that predicts metrically grounded 3D end-effector trajectories directly from a single RGB-D observation and a language instruction. Unlike 2D planners whose pixel waypoints inherit whatever depth lies beneath them, 3D HAMSTER plans in metric 3D space, so the trajectory stays geometrically grounded and can feed straight into a point-cloud low-level policy.
This repository hosts the planner checkpoint — a single self-contained checkpoint (9B, bf16) that bundles the Qwen3-VL LLM, the vision encoder, the geometry merger, and the frozen LingBot-Depth geometry encoder weights.
Usage
This is a custom architecture (Qwen3VLGeometryForConditionalGeneration) and requires the
hamster3d package (which vendors the geometry-encoder code). No separate LingBot-Depth download is needed — the encoder code is in the package and its weights are in this checkpoint.
# 1. Install the inference code
git clone https://github.com/DAVIAN-Robotics/3D_HAMSTER.git
cd 3D_HAMSTER && pip install -e .
# 2. Download this checkpoint into ./ckpt
hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt
from hamster3d.inference import Hamster3DPredictor
import numpy as np
from PIL import Image
predictor = Hamster3DPredictor("ckpt/") # device="cuda:0", bf16 by default
rgb = Image.open("examples/sample_0_rgb.png")
depth = np.load("examples/sample_0_depth.npy") # float32, meters, shape (H, W)
instruction = open("examples/sample_0_instruction.txt").read().strip()
result = predictor.predict(rgb, depth, instruction) # 3D trajectory prediction
print(result["waypoints"]) # [[u, v, depth], ...] pixel u,v (0-1000) + metric depth (m)
print(result["actions"]) # ["Close Gripper", None, ..., "Open Gripper"]
Inputs: an RGB image (any resolution; auto-resized to 640 px longest edge) + a metric depth map (float32, meters, aligned to the RGB frame) + a language instruction.
Output: a metric 3D end-effector trajectory — [u, v, depth] waypoints with per-waypoint gripper actions.
The Gradio demo in the code repo additionally supports 2D Trajectory, 2D/3D Pointing, 2D Bounding Box, and General VQA task styles.
Download tip: if
hf downloadstalls, disable the xet backend:HF_HUB_DISABLE_XET=1 hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt.
Model Details
| Component | Details |
|---|---|
| Base VLM | Qwen3-VL-8B (Stage-1 pretrained) |
| Geometry encoder | LingBot-Depth — DINOv2 ViT-L/14, frozen (~306M params) |
| Fusion | resize_and_add (element-wise add after spatial alignment) |
| Training | LoRA (rank 64, α 128) on the LLM + fully trained merger/decoder, with a dense depth-reconstruction loss |
| Precision | bfloat16 |
See the project page for benchmarks and qualitative results.
Acknowledgments & Licensing
Released under the Apache License 2.0. 3D HAMSTER builds on:
- Qwen3-VL — base vision-language model.
- LingBot-Depth (Robbyant, Apache-2.0) — the geometry encoder; its frozen weights are bundled in this checkpoint and its code is vendored in the
hamster3dpackage. - DINOv2 (Meta AI, Apache-2.0) — backbone of the LingBot-Depth encoder.
All bundled components are Apache-2.0; their attributions are retained.
Citation
@INPROCEEDINGS{hwang20263dhamster,
author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
year={2026}}
- Downloads last month
- -
Model tree for DAVIAN-Robotics/3D_HAMSTER
Base model
Qwen/Qwen3-VL-8B-Instruct