UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang1,*, Bofan Chen1,*, Zhengxi Lu1, Tongbo Chen1,
Songqin Nong2, Tao Jiang2, Wenhao Xu2, Weiming Lu1, Jun Xiao1, Yueting Zhuang1, Yongliang Shen1,†
1Zhejiang University 2Ant Group
*Equal contributions, Corresponding authors

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.

Introduction

Introduction figure for UI-Zoomer

Grounding natural language instructions to interface elements is a fundamental capability for autonomous GUI agents. Despite significant progress through supervised fine-tuning and reinforcement learning, models still fail systematically on small icons and dense layouts in complex interfaces.

A natural remedy is test-time zoom-in scaling: crop a region of the screenshot and re-run the model at higher effective resolution. While this paradigm has shown clear promise for fine-grained GUI localization, a more fundamental question remains unaddressed: which instances actually need zoom-in, and how much should we zoom?

Existing zoom-in methods share two fundamental limitations. First, they apply cropping indiscriminately and trigger zoom-in without regard to model uncertainty. Second, all existing methods fix the crop window to a predetermined ratio, leaving the crop either too broad or too narrow.

To address these issues, we propose UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding. It uses a reliability gate to avoid unnecessary zoom and an adaptive crop module based on prediction variance, achieving stronger robustness with low time costs.

Extensive experiments on three GUI grounding benchmarks demonstrate that UI-Zoomer consistently improves over strong baselines, with significant gains on icon targets and dense interfaces.

Method

UI-Zoomer is a training-free adaptive zoom-in framework for GUI grounding. Given a GUI screenshot I and a natural-language instruction q, the goal is to predict a click location in normalized image coordinates. Each localization hypothesis is represented as an axis-aligned bounding box [x1, y1, x2, y2], and the final click is defined as the center of the box.

UI-Zoomer proceeds in three stages: (1) global multi-sampling, (2) reliability gating, and (3) uncertainty-driven adaptive crop and zoom. The key idea is simple: zoom only when uncertain, and zoom by how much predictions disagree.

Global Multi-Sampling

UI-Zoomer samples N = 8 candidate boxes from the full screenshot with stochastic decoding at T = 0.9. Each valid prediction is paired with a token-level confidence score derived from generation probabilities.

Reliability Gating

UI-Zoomer decides whether zoom-in is needed by combining spatial consensus across sampled boxes with average token confidence. If the gating score is high, the result is returned directly through consensus voting.

Adaptive Crop and Zoom

For uncertain cases, UI-Zoomer filters outliers and estimates an adaptive crop window using variance decomposition. It then performs one deterministic zoom-in pass and maps the refined prediction back to the original image.

This design avoids unnecessary refinement on easy instances while allocating higher effective resolution to ambiguous ones. Compared with fixed-ratio zoom-in strategies, UI-Zoomer adapts both the trigger and the crop scale on a per-instance basis, making it more robust on dense layouts, small icons, and high-resolution professional interfaces.

Overview of UI-Zoomer

Detailed Pipeline

At a finer level, UI-Zoomer first draws multiple stochastic predictions from the full screenshot. It then computes a reliability score by combining mean pairwise IoU among sampled boxes with average token confidence. Reliable cases are handled directly by voting, while uncertain cases are routed to adaptive refinement.

For uncertain cases, UI-Zoomer keeps the most spatially consistent candidates, estimates the target uncertainty through inter-sample variance and intra-sample variance, and converts this uncertainty into an adaptive square crop. A single deterministic re-inference pass is then performed on the cropped image, and the refined box is mapped back to global coordinates.

Spatial Consensus

Cross-sample agreement is measured by the mean pairwise IoU of sampled boxes. High agreement indicates that the model is already confident about the target location.

Confidence-Aware Gating

The gating score combines spatial consensus and average token confidence. This allows UI-Zoomer to distinguish reliable predictions from ambiguous ones before deciding whether to zoom.

Var-Decomposed Crop

Crop size is determined from both positional disagreement across samples and the predicted spatial extent of each box. This makes the zoom-in region adaptive, rather than relying on a fixed crop ratio.

Main Results

UI-Zoomer consistently improves strong GUI grounding backbones across three challenging benchmarks: ScreenSpot-Pro, ScreenSpot-v2, and UI-Vision. The method achieves average gains of up to +13.4, +4.2, and +10.3 points on these benchmarks, respectively. These results show that uncertainty-driven adaptive zoom-in is effective across both general-purpose and GUI-specialized vision-language models.

ScreenSpot-Pro

Model Text Icon Avg + UI-Zoomer (Text) + UI-Zoomer (Icon) + UI-Zoomer (Avg) Gain
Qwen2.5-VL-7B 40.6 6.6 27.6 54.0 19.9 41.0 +13.4
GUI-G²-7B 64.1 23.3 48.7 76.7 36.8 61.4 +12.7
UI-Venus-7B 66.7 22.9 50.0 78.1 35.4 61.8 +11.8
UI-Venus-72B 74.0 35.3 59.2 81.3 46.0 67.8 +8.6

ScreenSpot-v2

Model Overall Text Overall Icon Overall Avg + UI-Zoomer (Text) + UI-Zoomer (Icon) + UI-Zoomer (Avg) Gain
UI-Venus-7B 97.4 89.7 94.0 97.8 91.2 94.9 +0.9
Qwen2.5-VL-7B 92.3 80.5 87.2 95.7 85.9 91.4 +4.2
GUI-G²-7B 97.1 89.2 93.6 96.5 90.3 93.8 +0.2

UI-Vision

Model Overall Text Overall Icon Overall Avg + UI-Zoomer (Text) + UI-Zoomer (Icon) + UI-Zoomer (Avg) Gain
Qwen2.5-VL-7B 38.2 7.9 13.3 57.0 16.3 23.6 +10.3
GUI-G²-7B 59.5 17.1 24.7 69.3 25.0 32.9 +8.2
UI-Venus-7B 60.6 16.5 24.4 72.7 25.2 33.7 +9.3
UI-Venus-72B 64.3 23.6 30.9 77.6 32.3 40.4 +9.5

Analysis

BibTeX

to be added after publication