Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Abstract

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

Methods

Our methods consist of two main components. The first part proposes a pipeline for constructing a reference-based reward model VerifyRM, which includes data collection and annotation strategies, as well as the training procedure for the reward model. The second part presents Cooper, a reinforcement learning algorithm that co-optimizes both the policy model and the reward model. In this framework, the RM trained in the first stage guides the policy model's updates within Cooper while being updated itself concurrently.

🎯 Training Recipe of VerifyRM

Most existing reward models score input-output pairs directly. However, in reasoning tasks, there typically exists a reference answer. We propose a method for constructing reference-based reward models to improve accuracy in reasoning tasks by incorporating reference answers into the reward model input.

Data Collection: 65K problem-reference-completion triples from 7 mathematical reasoning datasets using 11 mainstream LLMs
Hybrid Labeling: Automated approach combining rule-based verifier (Math-verify) and LLM-as-a-judge (Qwen3-4B), resulting in 58.7K high-quality examples
Model Training: Text classifier with reference answer input, trained using binary cross-entropy loss

🔄 Reinforcement Learning with Cooper

Cooper enables simultaneous tuning of policy and reward models in a single training step through a two-stage process:

Stage 1: Policy Model Optimization

Sample responses using policy πθ
Evaluate with reward model incorporating reference answers
Compute advantage estimates and update policy via gradient
Include KL divergence penalty for stability

Stage 2: Reward Model Optimization

Positive Sample Selection: Use rule-based rewards' high precision to identify correct responses
Negative Sample Generation: Assistant LLM transforms correct reasoning into incorrect answers
Optimize using contrastive learning to maximize score differences

💡 Key Innovation: Dynamic Data Strategy

Cooper's dynamic data strategy leverages the high precision of rule-based rewards for positive samples while using an assistant LLM to generate challenging negative samples. This continuous stream of high-quality preference pairs enables ongoing reward model refinement, making it more robust and resistant to hacking.

Main Experimental Results

Cooper achieves superior performance across diverse benchmarks. Main Experimental Results Table demonstrates Cooper's effectiveness: on Qwen2.5-1.5B-Instruct, Cooper achieves 58.02% average accuracy, outperforming rule-based rewards (57.48%) while dramatically surpassing the collapsed static reward model (38.91%). The improvements are consistent across both base models and particularly pronounced on challenging tasks like Math Odyssey (44.17% vs 42.83%), suggesting co-optimization becomes increasingly valuable for complex reasoning.

Static reward models suffer catastrophic failure from reward hacking. The most striking finding is the severe degradation of static reward models: performance drops from 54.93% to 38.91% on Qwen2.5-1.5B-Instruct, a 16% relative decrease. This collapse, consistent across both architectures, empirically validates that reward hacking is a fundamental failure mode in RL for LLMs. Cooper not only prevents this catastrophic failure but achieves the highest performance, confirming that synchronized co-optimization successfully addresses the exploitation vulnerability inherent in fixed reward functions.

Analysis

To understand the mechanisms underlying Cooper's effectiveness, we conduct a comprehensive analysis examining two key aspects: the training dynamics that reveal how Cooper prevents reward hacking and the stability of the co-optimized reward model.

Training Dynamics Analysis

Training Dynamics across RL Training Steps

To understand how Cooper prevents reward hacking, we examine the training dynamics shown above. The test accuracy on MATH500 (left) reveals a critical divergence: while rule-based rewards and Cooper show steady improvement, the static reward model catastrophically collapses around step 120, dropping from 58% to below 52%.

This collapse coincides with reward hacking visible in the training rewards (right), where the static model's training rewards unnaturally spike to near 1.0, indicating the policy has discovered exploits in the fixed reward function. In contrast, Cooper maintains realistic reward levels around 0.5 throughout training while achieving the highest final accuracy (58.05%).

🔍 Key Insight: Reward Hacking Prevention

This demonstrates that synchronized updates successfully prevent the policy from gaming the reward signal - as the policy evolves, the reward model adapts its decision boundaries, closing exploitation opportunities that would otherwise accumulate in a static system. The combined view clearly shows how Cooper's co-optimization approach maintains both stable rewards and superior performance.

Stability of Reward Model throughout Training

Accuracy of RM across Training Steps

A potential concern with Cooper is whether continuous updates might destabilize the reward model. The figure tracks VerifyRM's accuracy on VerifyBench throughout training, showing remarkable stability around 89.7% with fluctuations below 0.5%.

This stability emerges from our careful update mechanism: by using high-precision rule-based signals for positive examples and systematic perturbations for negatives, each update reinforces correct decision boundaries rather than introducing noise.

The consistent performance confirms that co-optimization can be implemented without the instability typically associated with moving target problems, validating that our contrastive learning approach maintains verification quality while adapting to new policy distributions.

BibTeX

@misc{hong2025coopercooptimizingpolicyreward,
      title={Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models}, 
      author={Haitao Hong and Yuchen Yan and Xingyu Wu and Guiyang Hou and Wenqi Zhang and Weiming Lu and Yongliang Shen and Jun Xiao},
      year={2025},
      eprint={2508.05613},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.05613}, 
}