GEM: Generative Supervision Helps Embodied Intelligence

Ruowen Zhao1, Bangguo Li1, Zuyan Liu1,2,†, Yinan Liang1, Junliang Ye1, Fangfu Liu1,
Diankun Wu1, Zhengyi Wang1, Xumin Yu2, Yongming Rao2,✉, Han Hu2, Jun Zhu1,✉
†Project Lead.✉Corresponding Author.
1Tsinghua University, 2Tencent Hunyuan

Project Page Paper GitHub Models Dataset

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations.

GEM Teaser
Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zzzrw/GEM-2B

Finetuned
(218)
this model

Dataset used to train zzzrw/GEM-2B

Paper for zzzrw/GEM-2B