3D HAMSTER : Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

🎉 Accepted to IROS 2026 — IEEE/RSJ International Conference on Intelligent Robots and Systems

3D HAMSTER is a depth-aware VLM planner that predicts metrically grounded 3D end-effector trajectories directly from a single RGB-D observation and a language instruction. Unlike 2D planners whose pixel waypoints inherit whatever depth lies beneath them, 3D HAMSTER plans in metric 3D space, so the trajectory stays geometrically grounded and can feed straight into a point-cloud low-level policy.

This repository hosts the planner checkpoint — a single self-contained checkpoint (9B, bf16) that bundles the Qwen3-VL LLM, the vision encoder, the geometry merger, and the frozen LingBot-Depth geometry encoder weights.

Usage

This is a custom architecture (Qwen3VLGeometryForConditionalGeneration) and requires the hamster3d package (which vendors the geometry-encoder code). No separate LingBot-Depth download is needed — the encoder code is in the package and its weights are in this checkpoint.

# 1. Install the inference code
git clone https://github.com/DAVIAN-Robotics/3D_HAMSTER.git
cd 3D_HAMSTER && pip install -e .

# 2. Download this checkpoint into ./ckpt
hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt

from hamster3d.inference import Hamster3DPredictor
import numpy as np
from PIL import Image

predictor = Hamster3DPredictor("ckpt/")          # device="cuda:0", bf16 by default

rgb = Image.open("examples/sample_0_rgb.png")
depth = np.load("examples/sample_0_depth.npy")   # float32, meters, shape (H, W)
instruction = open("examples/sample_0_instruction.txt").read().strip()

result = predictor.predict(rgb, depth, instruction)   # 3D trajectory prediction
print(result["waypoints"])   # [[u, v, depth], ...]  pixel u,v (0-1000) + metric depth (m)
print(result["actions"])     # ["Close Gripper", None, ..., "Open Gripper"]

Inputs: an RGB image (any resolution; auto-resized to 640 px longest edge) + a metric depth map (float32, meters, aligned to the RGB frame) + a language instruction. Output: a metric 3D end-effector trajectory — [u, v, depth] waypoints with per-waypoint gripper actions.

The Gradio demo in the code repo additionally supports 2D Trajectory, 2D/3D Pointing, 2D Bounding Box, and General VQA task styles.

Download tip: if hf download stalls, disable the xet backend: HF_HUB_DISABLE_XET=1 hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt.

Model Details

Component	Details
Base VLM	Qwen3-VL-8B (Stage-1 pretrained)
Geometry encoder	LingBot-Depth — DINOv2 ViT-L/14, frozen (~306M params)
Fusion	`resize_and_add` (element-wise add after spatial alignment)
Training	LoRA (rank 64, α 128) on the LLM + fully trained merger/decoder, with a dense depth-reconstruction loss
Precision	bfloat16

See the project page for benchmarks and qualitative results.

Acknowledgments & Licensing

Released under the Apache License 2.0. 3D HAMSTER builds on:

Qwen3-VL — base vision-language model.
LingBot-Depth (Robbyant, Apache-2.0) — the geometry encoder; its frozen weights are bundled in this checkpoint and its code is vendored in the hamster3d package.
DINOv2 (Meta AI, Apache-2.0) — backbone of the LingBot-Depth encoder.

All bundled components are Apache-2.0; their attributions are retained.

Citation

@INPROCEEDINGS{hwang20263dhamster,
  author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
  booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
  year={2026}}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

BF16

Video Preview

Robotics

Model tree for DAVIAN-Robotics/3D_HAMSTER

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(338)

this model