Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Abstract

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face an inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

Introduction

The core motivation of Code-A1 is that code RL needs rewards that evolve with the model. Static golden tests are sparse and quickly become saturated, while self-play with a single model creates a conflict between writing code and judging it.

Code-A1 resolves this by splitting the roles: one model writes code, the other attacks it with tests. This turns white-box access into an advantage rather than a reward-hacking risk, because the Test LLM can inspect the candidate program and generate implementation-specific adversarial cases.

Method

Code-A1 jointly trains a Code LLM and a Test LLM with opposite objectives. For each question, the Code LLM samples candidate solutions and the Test LLM generates multiple test suites conditioned on both the question and the candidate code. Generated tests are validated against ground-truth code before being used for reward computation.

Adversarial Rollout

The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects on those same solutions.

Mistake Book

Historical failures are replayed as a stable baseline so the Code LLM does not forget earlier weaknesses during later training.

Composite Reward

Test LLM balances test validity and adversarial difficulty with $$R_T(\hat{T}) = \alpha R_T^{val} + (1-\alpha) R_T^{adv}.$$

This design directly addresses the main failure modes discussed in the paper: black-box tests are too generic, pure adversarial rewards can become unstable, and a model trained only on current failures tends to forget resolved bugs. Code-A1 combines white-box testing, reward shaping, and replay to keep the co-evolution useful throughout training.

Detailed Pipeline

At the systems level, the framework separates Code LLM training and Test LLM training while connecting them through sandbox execution, validation, and replay retrieval. The pipeline below shows where candidate code is generated, where tests are validated, and where Mistake Book updates are applied.

Main Results

Code-A1 improves both sides of the training game. On code generation, it reaches the best average score across all three Qwen2.5-Coder scales. On test generation, the gains are even more striking: the 3B Test LLM trained with Code-A1 reaches a Mul score of 15.29, surpassing the 14.72 achieved by the unoptimized 7B base model.

Code LLM	Method	HumanEval+	MBPP+	BigCodeBench	Avg
Qwen2.5-Coder-1.5B	Base	63.42	60.87	29.34	51.21
	Golden Tests	71.15	63.30	34.23	56.23
	Self-Play	70.64	63.54	33.47	55.88
	Code-A1	72.69	63.33	34.82	56.95
Qwen2.5-Coder-3B	Base	77.63	63.12	41.78	60.84
	Golden Tests	81.96	68.05	45.41	65.14
	Self-Play	81.86	67.06	45.09	64.67
	Code-A1	83.52	69.07	45.85	66.15
Qwen2.5-Coder-7B	Base	83.69	71.95	49.41	68.35
	Golden Tests	84.68	74.16	52.28	70.37
	Self-Play	84.70	74.23	52.25	70.39
	Code-A1	85.21	74.50	52.46	70.72

Test LLM	Method	pass@5	mut@5	Mul
Qwen2.5-Coder-1.5B	Base	16.29	22.30	3.63
	SFT	14.76	29.45	4.35
	Self-Play	23.39	28.91	6.76
	Code-A1	27.05	26.41	7.14
Qwen2.5-Coder-3B	Base	20.93	42.55	8.91
	SFT	23.51	36.29	8.53
	Self-Play	29.64	50.92	15.09
	Code-A1	30.86	49.56	15.29
Qwen2.5-Coder-7B	Base	28.73	51.25	14.72
	SFT	28.72	50.85	14.60
	Self-Play	35.13	55.57	19.52
	Code-A1	37.15	53.14	19.74

Analysis

Training Dynamics

Training curves show the two models improving together. With $\alpha=0.5$, the Code LLM reward closely tracks the golden-test baseline, suggesting that dynamically generated tests can provide supervision comparable to human-annotated tests.

On the Test LLM side, the sustained reward remains strongest at the balanced setting, indicating that the generated tests stay both executable and adversarial enough to keep pushing the code model forward.

Golden Test Source	HE+	MBPP+	BCB	Avg
Human Annotation	71.15	63.30	34.23	56.23
Base Test LLM	69.68	63.66	32.85	55.40
SFT Test LLM	71.49	64.24	32.93	56.22
Code-A1 Test LLM	71.67	64.42	34.17	56.75

Generated Tests as Supervision

The paper further evaluates test quality by reusing generated tests as static golden tests for standard RLVR training. Tests produced by Code-A1 yield the best average score, even slightly surpassing human annotations.

This result is important because it shows the Test LLM is not merely helping online training; it is also synthesizing supervision of sufficient quality to replace expensive manually written tests.

Code LLM	Test LLM	HE+	MBPP+	BCB	Avg
Base	/	77.63	63.12	41.78	60.84
Code-A1	/	83.52	69.07	45.85	66.15
Base	Base	82.32	70.63	42.45	65.13
Code-A1	Base	84.76	71.69	45.91	67.45
Base	Code-A1	82.93	70.90	43.63	65.82
Code-A1	Code-A1	85.37	71.96	46.09	67.81

Test-Time Collaboration

Beyond training, Code-A1 also helps at inference time. Under parallel test-time scaling, the best combination is the full Code-A1 pair, where the evolved Test LLM can more effectively distinguish subtle correctness differences among candidate solutions.

The result suggests that the learned verifier is genuinely stronger, not just tuned for training-time rewards.

BibTeX

@misc{wang2026codea1adversarialevolvingcode,
      title={Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning}, 
      author={Aozhe Wang and Yuchen Yan and Nan Zhou and Zhengxi Lu and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2603.15611},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.15611}, 
}