KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen1,*, Zhengxi Lu1,*, Zhan Xu1,*, Guocheng Shao1,*, Shaohan Zhao1,*, Fei Tang1, Yong Du1, Kaitao Song2, Yizhou Liu1, Yuchen Yan1, Wenqi Zhang1, Xu Tan3, Weiming Lu1, Jun Xiao1, Yueting Zhuang1, Yongliang Shen1,†
1Zhejiang University 2Apple 3Tencent
*Equal Contribution Corresponding authors

Leaderboard

Success rate (%) on KnowU-Bench across 42 general, 86 personalized, and 64 proactive tasks. All results use max steps = 50. Higher is better. Please contact us to submit your results.

Rank Category Model Overall General Personalized Proactive

Overview

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to evaluate what this actually requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts — neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment.

KnowU-Bench is an online, interactive benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 192 tasks across 23 apps: 42 general, 86 personalized, and 64 proactive. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. An LLM-driven user simulator grounded in structured profiles supports multi-turn clarification and proactive consent handling. Evaluation covers the complete proactive decision chain through a hybrid protocol combining programmatic verification with LLM-as-Judge scoring.

Our evaluation reveals that agents excelling at explicit GUI execution degrade substantially once success depends on knowing the user or deciding whether to act at all. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration — exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

Evaluation Pipeline

KnowU-Bench orchestrates evaluation through a multi-stage pipeline. Each task is dispatched to a worker that manages parallelism and assigns it to a mobile agent running on Android emulators (AVDs). During execution, the agent interacts with a profile-grounded user agent to resolve ambiguous preferences through clarifying dialogue. Finally, a hybrid evaluator combines rule-based state checks with LLM-as-Judge semantic scoring to produce the final result.

KnowU-Bench evaluation pipeline: task dispatch, mobile agent execution, user agent interaction, and hybrid evaluation
Figure: End-to-end evaluation pipeline of KnowU-Bench.

Benchmark Comparison

Most prior benchmarks focus on either GUI execution or offline personalization. KnowU-Bench combines both in a live, reproducible mobile environment.

It is the only benchmark here that jointly covers vague instructions, proactive execution, user simulation, user logs, and user modeling for personalized and proactive GUI execution.

Comparison of KnowU-Bench with GUI execution and personalization benchmarks across five capability dimensions
Figure: KnowU-Bench is the only benchmark in this comparison that covers all five capability dimensions.

Task Categories

KnowU-Bench contains 192 tasks across three categories. The donut chart summarizes the benchmark composition, while the cards explain what each split is designed to test.

192
Total Tasks
42 General · 86 Personalized · 64 Proactive

Key Insights

We summarize the main experimental takeaways as a set of research-question-style findings, following the presentation style of benchmark analysis pages while grounding the content in the paper results.

RQ1

How much harder are personalized and proactive tasks?

  • General execution is no longer the bottleneck. MAI-UI-8B and Seed 2.0 Pro both reach 100.0% on general tasks, confirming that following explicit instructions is a solved problem for strong models.
  • Performance drops sharply once agents must infer preferences or calibrate proactive initiative. Claude Sonnet 4.6 achieves the best overall success rate at 60.4%, while all open-source models remain below 12% on the hard personalized split.
  • The average score on personalized tasks is consistently much higher than strict success rate, indicating that agents can partially align with user preferences yet still fail to compose them into fully correct end-to-end behavior.

RQ2

Does asking more questions improve personalization?

  • Asking more is not the same as asking better. The real bottleneck is converting user feedback into correct downstream actions, not the quantity of clarification queries.
  • Claude Sonnet 4.6 reaches 44.2% success rate and 78.9% average score while asking only 0.4 questions per task on average — the most efficient profile among all evaluated models.
  • Seed 2.0 Pro asks roughly twice as many questions yet still trails behind. Similarly, two Qwen models with near-identical query rates achieve noticeably different scores, confirming that interaction quality dominates over quantity.

RQ3

What makes proactive service difficult?

  • Proactivity is a calibration problem, not an execution problem. A deployable assistant must know when to act, when to ask, and when to stay silent — and no single model excels at all three.
  • Claude Sonnet 4.6 is the most balanced, achieving the best Act score at 70.8% while remaining competitive on Silent and Stop.
  • Qwen3.5-397B-A17B leads on Silent at 73.7% and Stop at 75.0%, but falls to 31.8% on Act — showing that restraint and initiative are not jointly optimized by current models.

RQ4

Where do current agents fail?

  • Personalized and proactive tasks fail for fundamentally different reasons. Improving one setting does not automatically transfer to the other.
  • On personalized tasks, Clarify errors dominate at 66.7% — agents commit to decisions without asking the right follow-up questions. Partial failures add another 27.1%, showing that even correct preference identification often fails at multi-constraint composition.
  • On proactive tasks, Intervention errors account for 60.0% and Passive errors for 20.0% — agents are more prone to unwarranted action than to missing warranted ones, making initiative over-trigger the dominant failure mode.

Environment And User Modeling

KnowU-Bench couples a reproducible Android environment, a GUI agent, an online user simulator grounded in profiles and logs, and a hybrid evaluation pipeline.

KnowU-Bench environment, mobile agent, and user agent overview

Acknowledgements

We thank MobileWorld for providing the original environment and benchmark foundation.