GUI-G²: Gaussian Reward Modeling for GUI Grounding

1Zhejiang University    2Ant Group   

Under Review



Motivation

Traditional GUI grounding approaches suffer from fundamental limitations in reward design. Existing reinforcement learning methods employ binary rewards that treat GUI elements as simple hit-or-miss targets, creating sparse feedback signals that fail to capture the continuous nature of spatial interactions. This binary paradigm ignores the nuanced spatial relationships inherent in human-computer interaction, where users naturally exhibit clicking patterns that form Gaussian distributions around target centers. Our analysis of real-world interaction data from the AITW dataset reveals that human clicks follow a natural Gaussian pattern (μ = 0.111, σ = 0.429), providing strong empirical evidence for modeling GUI elements as continuous distributions rather than discrete points. This insight motivates our GUI-G2 framework, which bridges the gap between human clicking behavior and machine learning rewards through principled Gaussian modeling.

GUI-G2 Performance Comparison and Human Click Behavior
GUI grounding performance and human click behavior. Left: Performance comparison of various models on ScreenSpot-Pro across different parameter scales. Right: Human click distribution from AITW reveals natural Gaussian patterns around target centers (μ = 0.111, σ = 0.429), validating our design choice of continuous Gaussian rewards over discrete binary feedback.

Abstract

Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G2 incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G2 substantially outperforms state-of-the-art methods, with the most significant improvement of 24.7% on ScreenSpot-Pro.


Methodology


Our GUI-G2 framework fundamentally reconceptualizes GUI grounding by modeling clicking points as smooth probability distributions across the interface plane. Rather than treating elements as discrete hit-or-miss targets, GUI-G2 represents them as continuous Gaussian distributions that provide rich spatial information and dense learning signals.

GUI-G2 Method Overview

Key Components

  • Gaussian Point Rewards: Model precise localization through exponentially decaying distributions centered on element centroids
  • Gaussian Coverage Rewards: Assess spatial alignment by measuring overlap between predicted and target distributions
  • Adaptive Variance Mechanism: Automatically calibrates reward distributions based on element dimensions

Experimental Results


ScreenSpot-Pro Results

Performance comparison of different agent models across various task categories based on Text, Icon, and Average scores on ScreenSpot-Pro. "-" indicates unreported results in original papers.

Model CAD Dev Creative Scientific Office OS Avg.
Text Icon Text Icon Text Icon Text Icon Text Icon Text Icon Text Icon Avg.
Proprietary Models
GPT-4o 2.0 0.0 1.3 0.0 1.0 0.0 2.1 0.0 1.1 0.0 0.0 0.0 1.3 0.0 0.8
Claude Computer Use 14.5 3.7 22.0 3.9 25.9 3.4 33.9 15.8 30.1 16.3 11.0 4.5 23.4 7.1 17.1
General Open-source Models
Qwen2.5-VL-3B 9.1 7.3 22.1 1.4 26.8 2.1 38.2 7.3 33.9 15.1 10.3 1.1 23.6 3.8 16.1
Qwen2.5-VL-7B 16.8 1.6 46.8 4.1 35.9 7.7 49.3 7.3 52.5 20.8 37.4 6.7 38.9 7.1 26.8
GUI-specific Models (SFT)
SeeClick-9.6B 2.5 0.0 0.6 0.0 1.0 0.0 3.5 0.0 1.1 0.0 2.8 0.0 1.8 0.0 1.1
Focus-2B 7.6 3.1 22.8 1.7 23.7 1.7 25.0 7.1 23.2 7.7 17.8 2.5 19.8 3.9 13.3
CogAgent-18B 7.1 3.1 14.9 0.7 9.6 0.0 22.2 1.8 13.0 0.0 5.6 0.0 12.0 0.8 7.7
Aria-UI 7.6 1.6 16.2 0.0 23.7 2.1 27.1 6.4 20.3 1.9 4.7 0.0 17.1 2.0 11.3
OS-Atlas-7B 12.2 4.7 33.1 1.4 28.8 2.8 37.5 7.3 33.9 5.7 27.1 4.5 28.1 4.0 18.9
ShowUI-2B 2.5 0.0 16.9 1.4 9.1 0.0 13.2 7.3 15.3 7.5 10.3 2.2 10.8 2.6 7.7
UGround-7B 14.2 1.6 26.6 2.1 27.3 2.8 31.9 2.7 31.6 11.3 17.8 0.0 25.0 2.8 16.5
UGround-V1-7B 15.8 1.2 51.9 2.8 47.5 9.7 57.6 14.5 60.5 13.2 38.3 7.9 45.2 8.1 31.1
UI-TARS-2B 17.8 4.7 47.4 4.1 42.9 6.3 56.9 17.3 50.3 17.0 21.5 5.6 39.6 8.4 27.7
UI-TARS-7B 20.8 9.4 58.4 12.4 50.0 9.1 63.9 31.8 63.3 20.8 30.8 16.9 47.8 16.2 35.7
UI-TARS-72B 18.8 12.5 62.9 17.2 57.1 15.4 64.6 20.9 63.3 26.4 42.1 15.7 50.9 17.6 38.1
Jedi-3B 27.4 9.4 61.0 13.8 53.5 8.4 54.2 18.2 64.4 32.1 38.3 9.0 49.8 13.7 36.1
Jedi-7B 38.0 14.1 42.9 11.0 50.0 11.9 72.9 25.5 75.1 47.2 33.6 16.9 52.6 18.2 39.5
GUI-Actor-7B - - - - - - - - - - - - - - 44.6
GUI-specific Models (RL)
UI-R1-3B 11.2 6.3 22.7 4.1 27.3 3.5 42.4 11.8 32.2 11.3 13.1 4.5 24.9 6.4 17.8
UI-R1-E-3B 37.1 12.5 46.1 6.9 41.9 4.2 56.9 21.8 65.0 26.4 32.7 10.1 - - 33.5
GUI-R1-3B 26.4 7.8 33.8 4.8 40.9 5.6 61.8 17.3 53.6 17.0 28.1 5.6 - - -
GUI-R1-7B 23.9 6.3 49.4 4.8 38.9 8.4 55.6 11.8 58.7 26.4 42.1 16.9 - - -
InfiGUI-R1-3B 33.0 14.1 51.3 12.4 44.9 7.0 58.3 20.0 65.5 28.3 43.9 12.4 49.1 14.1 35.7
GUI-G1-3B 39.6 9.4 50.7 10.3 36.6 11.9 61.8 30.0 67.2 32.1 23.5 10.6 49.5 16.8 37.1
SE-GUI-3B 38.1 12.5 55.8 7.6 47.0 4.9 61.8 16.4 59.9 24.5 40.2 12.4 50.4 11.8 35.9
SE-GUI-7B 51.3 42.2 68.2 19.3 57.6 9.1 75.0 28.2 78.5 43.4 49.5 25.8 63.5 21.0 47.3
Ours
GUI-G2-7B 55.8 12.5 68.8 17.2 57.1 15.4 77.1 24.5 74.0 32.7 57.9 21.3 64.7 19.6 47.5

ScreenSpot Results

Performance comparison on ScreenSpot benchmark. Bold highlights the best results, "-" indicates missing values due to unavailable results in the original paper, unreleased model checkpoints, and inference code.

Model Accuracy (%) Avg.
Mobile Desktop Web
Text Icon Text Icon Text Icon
Proprietary Models
GPT-4o 30.5 23.2 20.6 19.4 11.1 7.8 18.8
Claude Computer Use - - - - - - 83.0
General Open-source Models
Qwen2-VL-7B 61.3 39.3 52.0 45.0 33.0 21.8 42.9
Qwen2.5-VL-3B - - - - - - 55.5
Qwen2.5-VL-7B - - - - - - 84.7
GUI-specific Models (SFT)
CogAgent-18B 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick-9.6B 78.0 52.0 72.2 30.0 55.7 32.5 53.4
UGround-7B 82.8 60.3 82.5 63.6 80.4 70.4 73.3
OS-Atlas-7B 93.0 72.9 91.8 62.9 90.9 74.3 82.5
ShowUI-2B 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Focus-2B 90.1 78.2 80.9 65.0 81.7 68.5 77.4
Aguvis-7B 95.6 77.7 93.8 67.1 88.3 75.2 84.4
Aguvis-72B 94.5 85.2 95.4 77.9 91.3 85.9 89.2
UI-TARS-2B 93.0 75.5 90.7 68.6 84.3 74.8 82.3
UI-TARS-7B 94.5 85.2 95.9 85.7 90.0 83.5 89.5
UI-TARS-72B 94.9 82.5 89.7 88.6 88.7 85.0 88.4
GUI-Actor-7B 94.9 82.1 91.8 80.0 91.3 85.4 88.3
GUI-specific Models (RL)
UI-R1-3B 95.6 84.7 90.2 59.3 85.2 73.3 83.3
UI-R1-E-3B 97.1 83.0 95.4 77.9 91.7 85.0 89.2
GUI-R1-3B - - 93.8 64.8 89.6 72.1 -
GUI-R1-7B - - 91.8 73.6 91.3 75.7 -
InfiGUI-R1-3B 97.1 81.2 94.3 77.1 91.7 77.6 87.5
SE-GUI-7B - - - - - - 88.2
GUI-G1-3B 98.6 85.8 96.4 80.7 91.4 82.3 90.3
Ours
GUI-G2-7B 96.7 90.8 95.9 88.6 90.9 86.9 92.0

Analysis

Reward Analysis
Ablation Study

Key Insights


Continuous vs Binary Rewards

Human clicking behavior naturally follows Gaussian distributions. Our continuous reward mechanism captures this spatial uncertainty, providing dense learning signals compared to sparse binary rewards.

Adaptive Variance Mechanism

GUI elements vary significantly in size. Our adaptive variance mechanism automatically calibrates reward distributions based on element dimensions, ensuring consistent learning across interface components.

Dual Component Design

Gaussian Point Rewards focus on precise localization while Gaussian Coverage Rewards assess spatial overlap, together modeling both precision and coverage requirements.

Superior Performance

Achieves up to 24.7% improvement on ScreenSpot-Pro and outperforms models with 10x more parameters, demonstrating the effectiveness of our approach.


Citation

@misc{tang2025guig2gaussianrewardmodeling,
  title={GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding},
  author={Fei Tang and Zhangxuan Gu and Zhengxi Lu and Xuyang Liu and Shuheng Shen and Changhua Meng and Wen Wang and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
  year={2025},
  eprint={2507.15846},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2507.15846},
}


Acknowledgements

We thank the research community for their valuable feedback and the anonymous reviewers for their constructive comments. We also acknowledge the template from Ximing Xing that helped us build this project homepage.