RQ1
How much harder are personalized and proactive tasks?
- General execution is no longer the bottleneck. MAI-UI-8B and Seed 2.0 Pro both reach 100.0% on general tasks, confirming that following explicit instructions is a solved problem for strong models.
- Performance drops sharply once agents must infer preferences or calibrate proactive initiative. Claude Sonnet 4.6 achieves the best overall success rate at 60.4%, while all open-source models remain below 12% on the hard personalized split.
- The average score on personalized tasks is consistently much higher than strict success rate, indicating that agents can partially align with user preferences yet still fail to compose them into fully correct end-to-end behavior.