GUI-RCPO: Test-Time RL for GUI Grounding

Abstract

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

Methods

GUI-RC enhances GUI grounding by turning prediction uncertainty into reliable consensus through spatial aggregation. It first generates multiple predictions for the same input via temperature-based sampling, which naturally introduces diversity due to the continuous output space. These predictions are then mapped onto a spatial voting grid, where each predicted region is assigned votes. The grid captures how often each location is selected across sampling, and the final consensus region is extracted as the largest contiguous area with the highest vote count, representing the most consistent and confident region.

GUI-RCPO builds on GUI-RC by using region consistency as a self-supervised reward signal for test-time reinforcement learning. For each sampled prediction, it computes a reward proportional to the average vote density within the predicted region, encouraging alignment with high-consistency areas. The model is then optimized using Group Relative Policy Optimization (GRPO). Over successive updates, the model increasingly focuses on consistent, high-confidence regions, enabling progressive self-improvement without relying on labeled data.

Results

We apply our methods on a diverse VLMs to demonstrate their generality across different architectures and training paradigms, and evaluate them on three mainstream GUI grounding benchmarks: ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro. For GUI-RC, we sample 64 outputs using a temperature of 0.5 and a top_p of 0.95 for voting, the hyperparameter α is set to 50. For the baselines, we employ greedy decoding with temperature 0.

For GUI-RCPO, we adopt the VLM-R1 framework and conduct TTRL training on the Screenspot-v2 benchmark without using the ground-truth data. For each input, 16 samples are generated with a temperature of 0.7 and top_p of 0.95. We train the models for 2 epochs (approx. 40 steps) with a global batch size of 64, learning rate of 1e-6, and KL penalty β of 0.04.

The results reveal that our methods consistently improve the overall grounding capability across different models, regardless of its output style and whether the model is specifically trained for GUI tasks. GUI-RC improves accuracy by 2-3% on average, while GUI-RCPO achieves further gains of 4-5% on average through label-free optimization. Additionally, our methods achieve greater improvements when applied to bbox-style prediction models.

Ablation Studies on Decoding Strategy of GUI-RC

Ablation study results on ScreenSpot-v2 with varying temperature, sampling number, and hyperparameter α.

GUI-RCPO Enables Consistent Improvements during Test-time Training

Accuracy (%) across training steps of Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct throughout GUI-RCPO.

Applying GUI-RC after GUI-RCPO Leads to Further Improvements

Performance of applying GUI-RC to bbox-style prediction models after GUI-RCPO.

BibTeX

@misc{du2025testtimereinforcementlearninggui,
  title={Test-Time Reinforcement Learning for GUI Grounding via Region Consistency},
  author={Yong Du and Yuchen Yan and Fei Tang and Zhengxi Lu and Chang Zong and Weiming Lu and Shengpei Jiang and Yongliang Shen},
  year={2025},
  eprint={2508.05615},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.05615}
}

GUI-RCPO: Test-Time Reinforcement Learning for GUI Grounding via Region Consistency