Methods
Our methods consist of two main components. The first part proposes a pipeline for constructing a reference-based reward model VerifyRM, which includes data collection and annotation strategies, as well as the training procedure for the reward model. The second part presents Cooper, a reinforcement learning algorithm that co-optimizes both the policy model and the reward model. In this framework, the RM trained in the first stage guides the policy model's updates within Cooper while being updated itself concurrently.
🎯 Training Recipe of VerifyRM
Most existing reward models score input-output pairs directly. However, in reasoning tasks, there typically exists a reference answer. We propose a method for constructing reference-based reward models to improve accuracy in reasoning tasks by incorporating reference answers into the reward model input.
- Data Collection: 65K problem-reference-completion triples from 7 mathematical reasoning datasets using 11 mainstream LLMs
- Hybrid Labeling: Automated approach combining rule-based verifier (Math-verify) and LLM-as-a-judge (Qwen3-4B), resulting in 58.7K high-quality examples
- Model Training: Text classifier with reference answer input, trained using binary cross-entropy loss
🔄 Reinforcement Learning with Cooper
Cooper enables simultaneous tuning of policy and reward models in a single training step through a two-stage process:
Stage 1: Policy Model Optimization
- Sample responses using policy πθ
- Evaluate with reward model incorporating reference answers
- Compute advantage estimates and update policy via gradient
- Include KL divergence penalty for stability
Stage 2: Reward Model Optimization
- Positive Sample Selection: Use rule-based rewards' high precision to identify correct responses
- Negative Sample Generation: Assistant LLM transforms correct reasoning into incorrect answers
- Optimize using contrastive learning to maximize score differences
💡 Key Innovation: Dynamic Data Strategy
Cooper's dynamic data strategy leverages the high precision of rule-based rewards for positive samples while using an assistant LLM to generate challenging negative samples. This continuous stream of high-quality preference pairs enables ongoing reward model refinement, making it more robust and resistant to hacking.