Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Aozhe Wang1,*, Yuchen Yan1,*, Nan Zhou1,*, Zhengxi Lu1,*,
Weiming Lu1, Jun Xiao1, Yueting Zhuang1, Yongliang Shen1,†
1Zhejiang University
*Equal contributions, Corresponding authors

Abstract

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face an inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

Introduction

Introduction figure for Code-A1

The core motivation of Code-A1 is that code RL needs rewards that evolve with the model. Static golden tests are sparse and quickly become saturated, while self-play with a single model creates a conflict between writing code and judging it.

Code-A1 resolves this by splitting the roles: one model writes code, the other attacks it with tests. This turns white-box access into an advantage rather than a reward-hacking risk, because the Test LLM can inspect the candidate program and generate implementation-specific adversarial cases.

Method

Code-A1 jointly trains a Code LLM and a Test LLM with opposite objectives. For each question, the Code LLM samples candidate solutions and the Test LLM generates multiple test suites conditioned on both the question and the candidate code. Generated tests are validated against ground-truth code before being used for reward computation.

Method figure for Code-A1

Adversarial Rollout

The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects on those same solutions.

Mistake Book

Historical failures are replayed as a stable baseline so the Code LLM does not forget earlier weaknesses during later training.

Composite Reward

Test LLM balances test validity and adversarial difficulty with $$R_T(\hat{T}) = \alpha R_T^{val} + (1-\alpha) R_T^{adv}.$$

This design directly addresses the main failure modes discussed in the paper: black-box tests are too generic, pure adversarial rewards can become unstable, and a model trained only on current failures tends to forget resolved bugs. Code-A1 combines white-box testing, reward shaping, and replay to keep the co-evolution useful throughout training.

Detailed Pipeline

At the systems level, the framework separates Code LLM training and Test LLM training while connecting them through sandbox execution, validation, and replay retrieval. The pipeline below shows where candidate code is generated, where tests are validated, and where Mistake Book updates are applied.

Detailed training pipeline for Code-A1

Main Results

Code-A1 improves both sides of the training game. On code generation, it reaches the best average score across all three Qwen2.5-Coder scales. On test generation, the gains are even more striking: the 3B Test LLM trained with Code-A1 reaches a Mul score of 15.29, surpassing the 14.72 achieved by the unoptimized 7B base model.

Code LLM Method HumanEval+ MBPP+ BigCodeBench Avg
Qwen2.5-Coder-1.5B Base 63.42 60.87 29.34 51.21
Golden Tests 71.15 63.30 34.23 56.23
Self-Play 70.64 63.54 33.47 55.88
Code-A1 72.69 63.33 34.82 56.95
Qwen2.5-Coder-3B Base 77.63 63.12 41.78 60.84
Golden Tests 81.96 68.05 45.41 65.14
Self-Play 81.86 67.06 45.09 64.67
Code-A1 83.52 69.07 45.85 66.15
Qwen2.5-Coder-7B Base 83.69 71.95 49.41 68.35
Golden Tests 84.68 74.16 52.28 70.37
Self-Play 84.70 74.23 52.25 70.39
Code-A1 85.21 74.50 52.46 70.72
Test LLM Method pass@5 mut@5 Mul
Qwen2.5-Coder-1.5B Base 16.29 22.30 3.63
SFT 14.76 29.45 4.35
Self-Play 23.39 28.91 6.76
Code-A1 27.05 26.41 7.14
Qwen2.5-Coder-3B Base 20.93 42.55 8.91
SFT 23.51 36.29 8.53
Self-Play 29.64 50.92 15.09
Code-A1 30.86 49.56 15.29
Qwen2.5-Coder-7B Base 28.73 51.25 14.72
SFT 28.72 50.85 14.60
Self-Play 35.13 55.57 19.52
Code-A1 37.15 53.14 19.74

Analysis

BibTeX

@misc{wang2026codea1adversarialevolvingcode,
      title={Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning}, 
      author={Aozhe Wang and Yuchen Yan and Nan Zhou and Zhengxi Lu and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2603.15611},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.15611}, 
}