Instructions to use IffYuan/Embodied-R1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use IffYuan/Embodied-R1.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="IffYuan/Embodied-R1.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("IffYuan/Embodied-R1.5")
model = AutoModelForMultimodalLM.from_pretrained("IffYuan/Embodied-R1.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use IffYuan/Embodied-R1.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "IffYuan/Embodied-R1.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IffYuan/Embodied-R1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/IffYuan/Embodied-R1.5

SGLang

How to use IffYuan/Embodied-R1.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "IffYuan/Embodied-R1.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IffYuan/Embodied-R1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "IffYuan/Embodied-R1.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "IffYuan/Embodied-R1.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use IffYuan/Embodied-R1.5 with Docker Model Runner:
```
docker model run hf.co/IffYuan/Embodied-R1.5
```

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

🌐 Project Page | 💻 Code | 🧰 EmbodiedEvalKit | 🤗 Models & Datasets

"Reasoning initiates the action; Action fulfills the reasoning." — Wang Yangming (1509)

Overview

Embodied-R1.5 is a unified Embodied Foundation Model (EFM), built on Qwen3-VL-8B-Instruct, that integrates comprehensive embodied reasoning within a single architecture. Building on our prior work Embodied-R1, it leaps from a pointing specialist to a comprehensive EFM unifying three core capabilities:

Spatial cognition & reasoning — comprehend the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
Task planning & correction — cover the full task life cycle: long-horizon decomposition, next-step planning, process detection, error localization, and correction.
Embodied pointing & location — ground high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.

Trained on a 15B-token corpus with a multi-task balanced RL recipe, it further drives a Planner-Grounder-Corrector (PGC) closed-loop framework where one model acts as planner, grounder, and corrector to autonomously complete long-horizon real-world tasks. With only 8B parameters, Embodied-R1.5 is best on 16 of 24 embodied VLM benchmarks (avg. 70.4%), surpassing Gemini-Robotics-ER-1.5 and GPT-5.4; with light action-data fine-tuning it adapts into Embodied-R1.5-VLA, outperforming strong baselines like $\pi_{0.5}$ across 4 manipulation benchmark suites; and it generalizes zero-shot to real robots on instruction following, affordance grounding, articulated manipulation, and long-horizon tasks.

Output Conventions

Embodied-R1.5 follows the Qwen3-VL chat format and outputs structured answers inside <answer>...</answer> tags. The supported task types and their answer formats are:

Task Type	Answer Format (inside `<answer>`)
`multiple choice`	`A`
`numerical`	`3.14`
`open-ended`	free text
`math`	`$$-\dfrac{3}{2}$$`
`spatial grounding`	`{"boxes": [35, 227, 437, 932]}`
`point`	```json\n[{"point_2d": [230, 138]}]\n```
`trace`	```json\n[{"point_2d": [624, 469]}, ...]\n```
`trace_3d`	```json\n[{"point_2d": [463, 599], "depth": 1.08}, ...]\n```

Coordinate & unit conventions. All points (point_2d) and boxes are normalized to the [0, 1000] range, regardless of the original image resolution. For trace_3d, the depth value is in meters.

Quick Start

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "IffYuan/Embodied-R1.5"
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("scene.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "You are a robot performing manipulation tasks. "
                                     "The task instruction is: move the blue cube on top of the yellow cube. "
                                     "Use 2D points to mark the target location."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

The model reasons over the visual observation and emits its final decision within an <answer> tag, e.g. <answer>[{"point_2d": [750, 748]}]</answer>.

Inference with vLLM (Recommended)

For higher throughput, serve the model with vLLM:

vllm serve IffYuan/Embodied-R1.5 \
  --served-model-name "Embodied-R1.5" \
  --tensor-parallel-size 1 \
  --mm-encoder-tp-mode data \
  --gpu-memory-utilization 0.7 \
  --async-scheduling \
  --media-io-kwargs '{"video": {"num_frames": 32}, "image": {"max_num": 32}}' \
  --max_model_len 20000 \
  --limit-mm-per-prompt '{"image": 8, "video": 1}' \
  --host 0.0.0.0 --port 22002

More runnable examples (vLLM online / offline, HuggingFace, point decoding & visualization) are provided in the GitHub repository under inference/.

Evaluation

For benchmark evaluation, see EmbodiedEvalKit, our evaluation framework covering 25+ embodied benchmarks.

Training & Data

Embodied-R1.5 is trained in two stages: SFT (LLaMA-Factory) followed by RFT (EasyR1). Full training scripts are available in the GitHub repository. Datasets are released in the Embodied-R1.5 HuggingFace collection.

Citation

If you find Embodied-R1.5 useful in your research, please cite our work:

@article{yuan2026embodiedr15,
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
  author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Zhang, Shuoheng and Han, Linqi and Li, Yutong and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Hu, Yucheng and Liu, Yuhao and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
  journal={arXiv preprint},
  year={2026}
}

@article{yuan2025embodied,
  title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

@article{yuan2025seeing,
  title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

License

Released under the Apache 2.0 license.

Downloads last month: 33

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for IffYuan/Embodied-R1.5

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(298)

this model

Collection including IffYuan/Embodied-R1.5

Embodied-R1.5

Collection

Model and Datasets of Embodied-R1.5 • 6 items • Updated 2 days ago • 1