Grounding natural language instructions to interface elements is a fundamental capability for autonomous GUI agents. Despite significant progress through supervised fine-tuning and reinforcement learning, models still fail systematically on small icons and dense layouts in complex interfaces.
A natural remedy is test-time zoom-in scaling: crop a region of the screenshot and re-run the model at higher effective resolution. While this paradigm has shown clear promise for fine-grained GUI localization, a more fundamental question remains unaddressed: which instances actually need zoom-in, and how much should we zoom?
Existing zoom-in methods share two fundamental limitations. First, they apply cropping indiscriminately and trigger zoom-in without regard to model uncertainty. Second, all existing methods fix the crop window to a predetermined ratio, leaving the crop either too broad or too narrow.
To address these issues, we propose UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding. It uses a reliability gate to avoid unnecessary zoom and an adaptive crop module based on prediction variance, achieving stronger robustness with low time costs.
Extensive experiments on three GUI grounding benchmarks demonstrate that UI-Zoomer consistently improves over strong baselines, with significant gains on icon targets and dense interfaces.