GUI-RCPO Logo

GUI-RCPO: Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du1,2,*, Yuchen Yan1,*, Fei Tang1, Zhengxi Lu1, Chang Zong3,
Weiming Lu1, Shengpei Jiang4, Yongliang Shen1,†
1Zhejiang University   2Central South University   3Zhejiang University of Science and Technology   4SF Technology

*Equal Contribution, Corresponding Author
GUI-RCPO Framework Overview

Overview of our test-time scaling methods for GUI grounding. GUI-RC aggregates K sampled predictions through spatial voting to extract a consensus region, achieving more accurate localization than greedy decoding. GUI-RCPO computes region consistency rewards based on the voting heatmap and uses these self-supervised signals to update model parameters, enabling label-free improvement through test-time reinforcement learning.

Abstract

Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

Methods

GUI-RC enhances GUI grounding by turning prediction uncertainty into reliable consensus through spatial aggregation. It first generates multiple predictions for the same input via temperature-based sampling, which naturally introduces diversity due to the continuous output space. These predictions are then mapped onto a spatial voting grid, where each predicted region is assigned votes. The grid captures how often each location is selected across sampling, and the final consensus region is extracted as the largest contiguous area with the highest vote count, representing the most consistent and confident region.

GUI-RCPO builds on GUI-RC by using region consistency as a self-supervised reward signal for test-time reinforcement learning. For each sampled prediction, it computes a reward proportional to the average vote density within the predicted region, encouraging alignment with high-consistency areas. The model is then optimized using Group Relative Policy Optimization (GRPO). Over successive updates, the model increasingly focuses on consistent, high-confidence regions, enabling progressive self-improvement without relying on labeled data.

Results

We apply our methods on a diverse VLMs to demonstrate their generality across different architectures and training paradigms, and evaluate them on three mainstream GUI grounding benchmarks: ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro. For GUI-RC, we sample 64 outputs using a temperature of 0.5 and a top_p of 0.95 for voting, the hyperparameter α is set to 50. For the baselines, we employ greedy decoding with temperature 0.

GUI-RC results

For GUI-RCPO, we adopt the VLM-R1 framework and conduct TTRL training on the Screenspot-v2 benchmark without using the ground-truth data. For each input, 16 samples are generated with a temperature of 0.7 and top_p of 0.95. We train the models for 2 epochs (approx. 40 steps) with a global batch size of 64, learning rate of 1e-6, and KL penalty β of 0.04.

GUI-RCPO results

The results reveal that our methods consistently improve the overall grounding capability across different models, regardless of its output style and whether the model is specifically trained for GUI tasks. GUI-RC improves accuracy by 2-3% on average, while GUI-RCPO achieves further gains of 4-5% on average through label-free optimization. Additionally, our methods achieve greater improvements when applied to bbox-style prediction models.

BibTeX

@misc{du2025testtimereinforcementlearninggui,
  title={Test-Time Reinforcement Learning for GUI Grounding via Region Consistency},
  author={Yong Du and Yuchen Yan and Fei Tang and Zhengxi Lu and Chang Zong and Weiming Lu and Shengpei Jiang and Yongliang Shen},
  year={2025},
  eprint={2508.05615},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.05615}
}