Motivation
Traditional GUI grounding approaches suffer from fundamental limitations in reward design. Existing reinforcement learning methods employ binary rewards that treat GUI elements as simple hit-or-miss targets, creating sparse feedback signals that fail to capture the continuous nature of spatial interactions. This binary paradigm ignores the nuanced spatial relationships inherent in human-computer interaction, where users naturally exhibit clicking patterns that form Gaussian distributions around target centers. Our analysis of real-world interaction data from the AITW dataset reveals that human clicks follow a natural Gaussian pattern (μ = 0.111, σ = 0.429), providing strong empirical evidence for modeling GUI elements as continuous distributions rather than discrete points. This insight motivates our GUI-G2 framework, which bridges the gap between human clicking behavior and machine learning rewards through principled Gaussian modeling.

Abstract
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G2 substantially outperforms state-of-the-art methods, with the most significant improvement of 24.7% on ScreenSpot-Pro.
Methodology
Our GUI-G2 framework fundamentally reconceptualizes GUI grounding by modeling clicking points as smooth probability distributions across the interface plane. Rather than treating elements as discrete hit-or-miss targets, GUI-G2 represents them as continuous Gaussian distributions that provide rich spatial information and dense learning signals.

Key Components
- Gaussian Point Rewards: Model precise localization through exponentially decaying distributions centered on element centroids
- Gaussian Coverage Rewards: Assess spatial alignment by measuring overlap between predicted and target distributions
- Adaptive Variance Mechanism: Automatically calibrates reward distributions based on element dimensions
Experimental Results
ScreenSpot-Pro Results
Performance comparison of different agent models across various task categories based on Text, Icon, and Average scores on ScreenSpot-Pro. "-" indicates unreported results in original papers.
Model | CAD | Dev | Creative | Scientific | Office | OS | Avg. | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | Text | Icon | Avg. | |
Proprietary Models | |||||||||||||||
GPT-4o | 2.0 | 0.0 | 1.3 | 0.0 | 1.0 | 0.0 | 2.1 | 0.0 | 1.1 | 0.0 | 0.0 | 0.0 | 1.3 | 0.0 | 0.8 |
Claude Computer Use | 14.5 | 3.7 | 22.0 | 3.9 | 25.9 | 3.4 | 33.9 | 15.8 | 30.1 | 16.3 | 11.0 | 4.5 | 23.4 | 7.1 | 17.1 |
General Open-source Models | |||||||||||||||
Qwen2.5-VL-3B | 9.1 | 7.3 | 22.1 | 1.4 | 26.8 | 2.1 | 38.2 | 7.3 | 33.9 | 15.1 | 10.3 | 1.1 | 23.6 | 3.8 | 16.1 |
Qwen2.5-VL-7B | 16.8 | 1.6 | 46.8 | 4.1 | 35.9 | 7.7 | 49.3 | 7.3 | 52.5 | 20.8 | 37.4 | 6.7 | 38.9 | 7.1 | 26.8 |
GUI-specific Models (SFT) | |||||||||||||||
SeeClick-9.6B | 2.5 | 0.0 | 0.6 | 0.0 | 1.0 | 0.0 | 3.5 | 0.0 | 1.1 | 0.0 | 2.8 | 0.0 | 1.8 | 0.0 | 1.1 |
Focus-2B | 7.6 | 3.1 | 22.8 | 1.7 | 23.7 | 1.7 | 25.0 | 7.1 | 23.2 | 7.7 | 17.8 | 2.5 | 19.8 | 3.9 | 13.3 |
CogAgent-18B | 7.1 | 3.1 | 14.9 | 0.7 | 9.6 | 0.0 | 22.2 | 1.8 | 13.0 | 0.0 | 5.6 | 0.0 | 12.0 | 0.8 | 7.7 |
Aria-UI | 7.6 | 1.6 | 16.2 | 0.0 | 23.7 | 2.1 | 27.1 | 6.4 | 20.3 | 1.9 | 4.7 | 0.0 | 17.1 | 2.0 | 11.3 |
OS-Atlas-7B | 12.2 | 4.7 | 33.1 | 1.4 | 28.8 | 2.8 | 37.5 | 7.3 | 33.9 | 5.7 | 27.1 | 4.5 | 28.1 | 4.0 | 18.9 |
ShowUI-2B | 2.5 | 0.0 | 16.9 | 1.4 | 9.1 | 0.0 | 13.2 | 7.3 | 15.3 | 7.5 | 10.3 | 2.2 | 10.8 | 2.6 | 7.7 |
UGround-7B | 14.2 | 1.6 | 26.6 | 2.1 | 27.3 | 2.8 | 31.9 | 2.7 | 31.6 | 11.3 | 17.8 | 0.0 | 25.0 | 2.8 | 16.5 |
UGround-V1-7B | 15.8 | 1.2 | 51.9 | 2.8 | 47.5 | 9.7 | 57.6 | 14.5 | 60.5 | 13.2 | 38.3 | 7.9 | 45.2 | 8.1 | 31.1 |
UI-TARS-2B | 17.8 | 4.7 | 47.4 | 4.1 | 42.9 | 6.3 | 56.9 | 17.3 | 50.3 | 17.0 | 21.5 | 5.6 | 39.6 | 8.4 | 27.7 |
UI-TARS-7B | 20.8 | 9.4 | 58.4 | 12.4 | 50.0 | 9.1 | 63.9 | 31.8 | 63.3 | 20.8 | 30.8 | 16.9 | 47.8 | 16.2 | 35.7 |
UI-TARS-72B | 18.8 | 12.5 | 62.9 | 17.2 | 57.1 | 15.4 | 64.6 | 20.9 | 63.3 | 26.4 | 42.1 | 15.7 | 50.9 | 17.6 | 38.1 |
Jedi-3B | 27.4 | 9.4 | 61.0 | 13.8 | 53.5 | 8.4 | 54.2 | 18.2 | 64.4 | 32.1 | 38.3 | 9.0 | 49.8 | 13.7 | 36.1 |
Jedi-7B | 38.0 | 14.1 | 42.9 | 11.0 | 50.0 | 11.9 | 72.9 | 25.5 | 75.1 | 47.2 | 33.6 | 16.9 | 52.6 | 18.2 | 39.5 |
GUI-Actor-7B | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 44.6 |
GUI-specific Models (RL) | |||||||||||||||
UI-R1-3B | 11.2 | 6.3 | 22.7 | 4.1 | 27.3 | 3.5 | 42.4 | 11.8 | 32.2 | 11.3 | 13.1 | 4.5 | 24.9 | 6.4 | 17.8 |
UI-R1-E-3B | 37.1 | 12.5 | 46.1 | 6.9 | 41.9 | 4.2 | 56.9 | 21.8 | 65.0 | 26.4 | 32.7 | 10.1 | - | - | 33.5 |
GUI-R1-3B | 26.4 | 7.8 | 33.8 | 4.8 | 40.9 | 5.6 | 61.8 | 17.3 | 53.6 | 17.0 | 28.1 | 5.6 | - | - | - |
GUI-R1-7B | 23.9 | 6.3 | 49.4 | 4.8 | 38.9 | 8.4 | 55.6 | 11.8 | 58.7 | 26.4 | 42.1 | 16.9 | - | - | - |
InfiGUI-R1-3B | 33.0 | 14.1 | 51.3 | 12.4 | 44.9 | 7.0 | 58.3 | 20.0 | 65.5 | 28.3 | 43.9 | 12.4 | 49.1 | 14.1 | 35.7 |
GUI-G1-3B | 39.6 | 9.4 | 50.7 | 10.3 | 36.6 | 11.9 | 61.8 | 30.0 | 67.2 | 32.1 | 23.5 | 10.6 | 49.5 | 16.8 | 37.1 |
SE-GUI-3B | 38.1 | 12.5 | 55.8 | 7.6 | 47.0 | 4.9 | 61.8 | 16.4 | 59.9 | 24.5 | 40.2 | 12.4 | 50.4 | 11.8 | 35.9 |
SE-GUI-7B | 51.3 | 42.2 | 68.2 | 19.3 | 57.6 | 9.1 | 75.0 | 28.2 | 78.5 | 43.4 | 49.5 | 25.8 | 63.5 | 21.0 | 47.3 |
Ours | |||||||||||||||
GUI-G2-7B | 55.8 | 12.5 | 68.8 | 17.2 | 57.1 | 15.4 | 77.1 | 24.5 | 74.0 | 32.7 | 57.9 | 21.3 | 64.7 | 19.6 | 47.5 |
ScreenSpot Results
Performance comparison on ScreenSpot benchmark. Bold highlights the best results, "-" indicates missing values due to unavailable results in the original paper, unreleased model checkpoints, and inference code.
Model | Accuracy (%) | Avg. | |||||
---|---|---|---|---|---|---|---|
Mobile | Desktop | Web | |||||
Text | Icon | Text | Icon | Text | Icon | ||
Proprietary Models | |||||||
GPT-4o | 30.5 | 23.2 | 20.6 | 19.4 | 11.1 | 7.8 | 18.8 |
Claude Computer Use | - | - | - | - | - | - | 83.0 |
General Open-source Models | |||||||
Qwen2-VL-7B | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.9 |
Qwen2.5-VL-3B | - | - | - | - | - | - | 55.5 |
Qwen2.5-VL-7B | - | - | - | - | - | - | 84.7 |
GUI-specific Models (SFT) | |||||||
CogAgent-18B | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 |
SeeClick-9.6B | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 82.5 |
ShowUI-2B | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
Focus-2B | 90.1 | 78.2 | 80.9 | 65.0 | 81.7 | 68.5 | 77.4 |
Aguvis-7B | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 84.4 |
Aguvis-72B | 94.5 | 85.2 | 95.4 | 77.9 | 91.3 | 85.9 | 89.2 |
UI-TARS-2B | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | 82.3 |
UI-TARS-7B | 94.5 | 85.2 | 95.9 | 85.7 | 90.0 | 83.5 | 89.5 |
UI-TARS-72B | 94.9 | 82.5 | 89.7 | 88.6 | 88.7 | 85.0 | 88.4 |
GUI-Actor-7B | 94.9 | 82.1 | 91.8 | 80.0 | 91.3 | 85.4 | 88.3 |
GUI-specific Models (RL) | |||||||
UI-R1-3B | 95.6 | 84.7 | 90.2 | 59.3 | 85.2 | 73.3 | 83.3 |
UI-R1-E-3B | 97.1 | 83.0 | 95.4 | 77.9 | 91.7 | 85.0 | 89.2 |
GUI-R1-3B | - | - | 93.8 | 64.8 | 89.6 | 72.1 | - |
GUI-R1-7B | - | - | 91.8 | 73.6 | 91.3 | 75.7 | - |
InfiGUI-R1-3B | 97.1 | 81.2 | 94.3 | 77.1 | 91.7 | 77.6 | 87.5 |
SE-GUI-7B | - | - | - | - | - | - | 88.2 |
GUI-G1-3B | 98.6 | 85.8 | 96.4 | 80.7 | 91.4 | 82.3 | 90.3 |
Ours | |||||||
GUI-G2-7B | 96.7 | 90.8 | 95.9 | 88.6 | 90.9 | 86.9 | 92.0 |
Analysis


Key Insights
Continuous vs Binary Rewards
Human clicking behavior naturally follows Gaussian distributions. Our continuous reward mechanism captures this spatial uncertainty, providing dense learning signals compared to sparse binary rewards.
Adaptive Variance Mechanism
GUI elements vary significantly in size. Our adaptive variance mechanism automatically calibrates reward distributions based on element dimensions, ensuring consistent learning across interface components.
Dual Component Design
Gaussian Point Rewards focus on precise localization while Gaussian Coverage Rewards assess spatial overlap, together modeling both precision and coverage requirements.
Superior Performance
Achieves up to 24.7% improvement on ScreenSpot-Pro and outperforms models with 10x more parameters, demonstrating the effectiveness of our approach.
Citation
@misc{tang2025guig2gaussianrewardmodeling,
title={GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding},
author={Fei Tang and Zhangxuan Gu and Zhengxi Lu and Xuyang Liu and Shuheng Shen and Changhua Meng and Wen Wang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
year={2025},
eprint={2507.15846},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.15846},
}
Acknowledgements
We thank the research community for their valuable feedback and the anonymous reviewers for their constructive comments. We also acknowledge the template from Ximing Xing that helped us build this project homepage.