VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan^1,2,*, Jin Jiang^2,3, Zhenbang Ren^1,4, Yijun Li⁵, Xudong Cai^1,5, Yang Liu²,
Xin Xu⁶, Mengdi Zhang², Jian Shao^1,†, Yongliang Shen^1,†, Jun Xiao¹, Yueting Zhuang¹

¹Zhejiang University ²Meituan Group ³Peking University
⁴University of Electronic Science and Technology of China
⁵Beijing University of Posts and Telecommunications
⁶The Hong Kong University of Science and Technology

Preprint. Under review.
^*Work done during internship at Meituan Group.^†Corresponding Author

Abstract

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical component of their training is the incorporation of reference-based reward systems within reinforcement learning (RL), where model outputs are evaluated against ground truth references. However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reasoning model training. In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Our comprehensive evaluation reveals that while larger model-based verifiers show promise on standard cases, all current systems demonstrate substantial room for improvement on challenging instances. Through systematic analysis of performance patterns across reasoning tasks and error categories, we provide insights for advancing reference-based reward systems. These benchmarks establish a standardized framework for improving verification accuracy, ultimately enhancing reasoning capabilities in models trained via RL.

Overall Performance

We evaluate the performance of various verification approaches on both VerifyBench and VerifyBench-hard. For rule-based baselines, we adopt the widely used math-verify package. In the LLM-as-a-judge setting, we prompt LLMs to perform verification.

Overall performance(%) of VerifyBench and VerifyBench-Hard. Num stands for Numeric Values, Exp stands for Expressions, MC stands for Multi-choice and Str stands for String.

BibTeX

@misc{yan2025verifybench, title={VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models}, author={Yuchen Yan and Jin Jiang and Zhenbang Ren and Yijun Li and Xudong Cai and Yang Liu and Xin Xu and Mengdi Zhang and Jian Shao and Yongliang Shen and Jun Xiao and Yueting Zhuang}, year={2025}, eprint={2505.15801}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.15801}, }

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

The core distinction between existing reward benchmarks and VerifyBench & VerifyBench-Hard.

Abstract

Benchmark Construction Pipeline

Benchmark Examples

Overall Performance

Benchmark statistics

Benchmark statistics of VerifyBench and VerifyBench-Hard.

Reference-answers Play an Important Role in Verification

Evaluation results(%) about how including the reference answer in the prompt influences the performance of LLM-as-a-judge.

Performance of Reference-free Reward Models

The performance(%) of existing reward models on VerifyBench without access to reference answers, as well as a comparison with existing reward benchmarks.

Fine-grained error analysis

Model performance(%) across the fine-grained taxonomy on VerifyBench. Q32B stands for Qwen3-32B, g4o stands for gpt-4o-2024-11-20, L70B stands for Llama-3.3-70B-Instruct and L3B stands for Llama-3.2-3B-Instruct.

Correlation Analysis with Rejective Fine-tuning

The performance(%) of RFT across different LLM judges which have various performance on VerifyBench.

BibTeX