Update pipeline tag to robotics and improve model card (#1)

c6765aa about 10 hours ago

6.14 kB

	---
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	language:
	- en
	- zh
	library_name: transformers
	license: apache-2.0
	pipeline_tag: robotics
	tags:
	- embodied-ai
	- vision-language-model
	- embodied-reasoning
	- spatial-reasoning
	- pointing
	- vla
	- qwen3-vl
	---

	# Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

	<p align="center">
	🌐 <a href="https://embodied-r.github.io/">Project Page</a>  \|
	💻 <a href="https://github.com/pickxiguapi/Embodied-R1.5">Code</a>  \|
	🧰 <a href="https://github.com/pickxiguapi/EmbodiedEvalKit">EmbodiedEvalKit</a>  \|
	🤗 <a href="https://huggingface.co/collections/IffYuan/embodied-r15">Models & Datasets</a>  \|
	📄 <a href="https://huggingface.co/papers/2606.11324">Paper</a>
	</p>

	> "Reasoning initiates the action; Action fulfills the reasoning." — Wang Yangming (1509)

	## Overview

	Embodied-R1.5 is a unified Embodied Foundation Model (EFM), built on Qwen3-VL-8B-Instruct, that integrates comprehensive embodied reasoning within a single architecture. Building on [Embodied-R1](https://github.com/pickxiguapi/Embodied-R1), it leaps from a pointing specialist to a comprehensive EFM unifying three core capabilities:

	- Spatial cognition & reasoning — comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
	- Task planning & correction — cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
	- Embodied pointing & location — ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.

	Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a Planner-Grounder-Corrector (PGC) closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks.

	## Output Conventions

	Embodied-R1.5 follows the Qwen3-VL chat format and outputs structured answers inside `<answer>...</answer>` tags. The supported task types and their answer formats are:

	\| Task Type \| Answer Format (inside `<answer>`) \|
	\|-----------\|-----------------------------------\|
	\| `multiple choice` \| `A` \|
	\| `numerical` \| `3.14` \|
	\| `open-ended` \| free text \|
	\| `math` \| `$$-\dfrac{3}{2}$$` \|
	\| `spatial grounding` \| `{"boxes": [35, 227, 437, 932]}` \|
	\| `point` \| ` ```json
	[{"point_2d": [230, 138]}]
	``` ` \|
	\| `trace` \| ` ```json
	[{"point_2d": [624, 469]}, ...]
	``` ` \|
	\| `trace_3d` \| ` ```json
	[{"point_2d": [463, 599], "depth": 1.08}, ...]
	``` ` \|

	> Coordinate & unit conventions. All points (`point_2d`) and boxes are normalized to the `[0, 1000]` range, regardless of the original image resolution. For `trace_3d`, the `depth` value is in meters.

	## Quick Start

	```python
	from transformers import AutoModelForImageTextToText, AutoProcessor
	from PIL import Image

	model_id = "IffYuan/Embodied-R1.5"
	model = AutoModelForImageTextToText.from_pretrained(
	model_id, torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained(model_id)

	image = Image.open("scene.jpg")
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "You are a robot performing manipulation tasks. "
	"The task instruction is: move the blue cube on top of the yellow cube. "
	"Use 2D points to mark the target location."},
	],
	}
	]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=512)
	print(processor.batch_decode(out, skip_special_tokens=True)[0])
	```

	The model reasons over the visual observation and emits its final decision within an `<answer>` tag, e.g. `<answer>[{"point_2d": [750, 748]}]</answer>`.

	## Inference with vLLM (Recommended)

	For higher throughput, serve the model with vLLM:

	```bash
	vllm serve IffYuan/Embodied-R1.5 \
	--served-model-name "Embodied-R1.5" \
	--tensor-parallel-size 1 \
	--mm-encoder-tp-mode data \
	--gpu-memory-utilization 0.7 \
	--async-scheduling \
	--media-io-kwargs '{"video": {"num_frames": 32}, "image": {"max_num": 32}}' \
	--max_model_len 20000 \
	--limit-mm-per-prompt '{"image": 8, "video": 1}' \
	--host 0.0.0.0 --port 22002
	```

	## Evaluation

	For benchmark evaluation, see [EmbodiedEvalKit](https://github.com/pickxiguapi/EmbodiedEvalKit), an evaluation framework covering 25+ embodied benchmarks.

	## Training & Data

	Embodied-R1.5 is trained in two stages: SFT (LLaMA-Factory) followed by RFT (EasyR1). Full training scripts are available in the [GitHub repository](https://github.com/pickxiguapi/Embodied-R1.5). Datasets are released in the [Embodied-R1.5 HuggingFace collection](https://huggingface.co/collections/IffYuan/embodied-r15).

	## Citation

	If you find Embodied-R1.5 useful in your research, please cite:

	```bibtex
	@article{yuan2026embodiedr15,
	title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
	author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Li, Yutong and Zhang, Shuoheng and Han, Linqi and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Zhao Zhang and Liu, Yuhao and Liao, Ruihao and Hu, Yucheng and Wu, Qiyu and Li, Yuxiao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
	journal={arXiv preprint},
	year={2026}
	}

	@article{yuan2025embodied,
	title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
	author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
	journal={ICLR 2026},
	year={2025}
	}
	```

	## License

	Released under the Apache 2.0 license.