UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose UI-Zoomer, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.

Introduction

Grounding natural language instructions to interface elements is a fundamental capability for autonomous GUI agents. Despite significant progress through supervised fine-tuning and reinforcement learning, models still fail systematically on small icons and dense layouts in complex interfaces.

A natural remedy is test-time zoom-in scaling: crop a region of the screenshot and re-run the model at higher effective resolution. While this paradigm has shown clear promise for fine-grained GUI localization, a more fundamental question remains unaddressed: which instances actually need zoom-in, and how much should we zoom?

Existing zoom-in methods share two fundamental limitations. First, they apply cropping indiscriminately and trigger zoom-in without regard to model uncertainty. Second, all existing methods fix the crop window to a predetermined ratio, leaving the crop either too broad or too narrow.

To address these issues, we propose UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding. It uses a reliability gate to avoid unnecessary zoom and an adaptive crop module based on prediction variance, achieving stronger robustness with low time costs.

Extensive experiments on three GUI grounding benchmarks demonstrate that UI-Zoomer consistently improves over strong baselines, with significant gains on icon targets and dense interfaces.

Method

UI-Zoomer is a training-free adaptive zoom-in framework for GUI grounding. Given a GUI screenshot I and a natural-language instruction q, the goal is to predict a click location in normalized image coordinates. Each localization hypothesis is represented as an axis-aligned bounding box [x1, y1, x2, y2], and the final click is defined as the center of the box.

UI-Zoomer proceeds in three stages: (1) global multi-sampling, (2) reliability gating, and (3) uncertainty-driven adaptive crop and zoom. The key idea is simple: zoom only when uncertain, and zoom by how much predictions disagree.

Global Multi-Sampling

UI-Zoomer samples N = 8 candidate boxes from the full screenshot with stochastic decoding at T = 0.9. Each valid prediction is paired with a token-level confidence score derived from generation probabilities.

Reliability Gating

UI-Zoomer decides whether zoom-in is needed by combining spatial consensus across sampled boxes with average token confidence. If the gating score is high, the result is returned directly through consensus voting.

Adaptive Crop and Zoom

For uncertain cases, UI-Zoomer filters outliers and estimates an adaptive crop window using variance decomposition. It then performs one deterministic zoom-in pass and maps the refined prediction back to the original image.

This design avoids unnecessary refinement on easy instances while allocating higher effective resolution to ambiguous ones. Compared with fixed-ratio zoom-in strategies, UI-Zoomer adapts both the trigger and the crop scale on a per-instance basis, making it more robust on dense layouts, small icons, and high-resolution professional interfaces.

Detailed Pipeline

At a finer level, UI-Zoomer first draws multiple stochastic predictions from the full screenshot. It then computes a reliability score by combining mean pairwise IoU among sampled boxes with average token confidence. Reliable cases are handled directly by voting, while uncertain cases are routed to adaptive refinement.

For uncertain cases, UI-Zoomer keeps the most spatially consistent candidates, estimates the target uncertainty through inter-sample variance and intra-sample variance, and converts this uncertainty into an adaptive square crop. A single deterministic re-inference pass is then performed on the cropped image, and the refined box is mapped back to global coordinates.

Spatial Consensus

Cross-sample agreement is measured by the mean pairwise IoU of sampled boxes. High agreement indicates that the model is already confident about the target location.

Confidence-Aware Gating

The gating score combines spatial consensus and average token confidence. This allows UI-Zoomer to distinguish reliable predictions from ambiguous ones before deciding whether to zoom.

Var-Decomposed Crop

Crop size is determined from both positional disagreement across samples and the predicted spatial extent of each box. This makes the zoom-in region adaptive, rather than relying on a fixed crop ratio.

Main Results

UI-Zoomer consistently improves strong GUI grounding backbones across three challenging benchmarks: ScreenSpot-Pro, ScreenSpot-v2, and UI-Vision. The method achieves average gains of up to +13.4, +4.2, and +10.3 points on these benchmarks, respectively. These results show that uncertainty-driven adaptive zoom-in is effective across both general-purpose and GUI-specialized vision-language models.

ScreenSpot-Pro

Model	Text	Icon	Avg	+ UI-Zoomer (Text)	+ UI-Zoomer (Icon)	+ UI-Zoomer (Avg)	Gain
Qwen2.5-VL-7B	40.6	6.6	27.6	54.0	19.9	41.0	+13.4
GUI-G²-7B	64.1	23.3	48.7	76.7	36.8	61.4	+12.7
UI-Venus-7B	66.7	22.9	50.0	78.1	35.4	61.8	+11.8
UI-Venus-72B	74.0	35.3	59.2	81.3	46.0	67.8	+8.6

ScreenSpot-v2

Model	Overall Text	Overall Icon	Overall Avg	+ UI-Zoomer (Text)	+ UI-Zoomer (Icon)	+ UI-Zoomer (Avg)	Gain
UI-Venus-7B	97.4	89.7	94.0	97.8	91.2	94.9	+0.9
Qwen2.5-VL-7B	92.3	80.5	87.2	95.7	85.9	91.4	+4.2
GUI-G²-7B	97.1	89.2	93.6	96.5	90.3	93.8	+0.2

UI-Vision

Model	Overall Text	Overall Icon	Overall Avg	+ UI-Zoomer (Text)	+ UI-Zoomer (Icon)	+ UI-Zoomer (Avg)	Gain
Qwen2.5-VL-7B	38.2	7.9	13.3	57.0	16.3	23.6	+10.3
GUI-G²-7B	59.5	17.1	24.7	69.3	25.0	32.9	+8.2
UI-Venus-7B	60.6	16.5	24.4	72.7	25.2	33.7	+9.3
UI-Venus-72B	64.3	23.6	30.9	77.6	32.3	40.4	+9.5

Analysis

Gating Threshold

We ablate the confidence-based threshold τ on ScreenSpot-v2 with fixed σ = 4.5 to study the trade-off between direct prediction and adaptive cropping.

Moderate thresholds work best: low τ reduces UI-Zoomer to the baseline, while high τ crops too many samples and hurts both cost and accuracy.

The gating mechanism balances efficiency and performance, with especially strong gains on Desktop and Web, where layouts are denser and targets are smaller.

Gating Signal Reliability

We bin ScreenSpot-Pro samples by C_spatial and avg_conf to test whether the two gating signals reflect prediction confidence.

Both signals show a clear positive correlation with accuracy, indicating that they are effective proxies for localization reliability.

They are also complementary: C_spatial is more spread out, while avg_conf is more concentrated, so combining them yields a more discriminative gating score.

Sampling Number & Temperature

We ablate the sampling temperature T and rollout number N on ScreenSpot-Pro to study how candidate diversity affects UI-Zoomer.

Higher temperature improves performance, with the best accuracy achieved at T = 0.9, showing that diverse candidates better support consensus-based crop estimation.

We also find that N = 8 gives the best accuracy-efficiency trade-off; more candidates add redundancy and noise rather than improving the crop.

BibTeX


      @misc{tang2026uizoomeruncertaintydrivenadaptivezoomin,
      title={UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding}, 
      author={Fei Tang and Bofan Chen and Zhengxi Lu and Tongbo Chen and Songqin Nong and Tao Jiang and Wenhao Xu and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2604.14113},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.14113}, 
      }