arxiv:2606.23049

Training Open Models for Agentic Phone Use

Published on Jun 22

· Submitted by

taesiri on Jun 23

Tencent Hunyuan

Upvote

Authors:

Zhengyang Tang ,

Junyi Li ,

Abstract

PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

noahml

about 7 hours ago

This is a really interesting approach to the phone agent problem. Using a mix of real and mock environments to bridge that gap between simulation speed and real-world reliability makes a lot of sense, especially since resetting real apps is such a headache.

I'm curious if you have any thoughts on why cross-app workflows are still lagging behind. Do you think the bottleneck is more about the model's long-term memory or the complexity of moving between distinct app interfaces?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/dd8dcb05-a1cd-43ec-a37f-f2a03b2509ac

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.23049

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23049 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.23049 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23049 in a Space README.md to link it from this page.