top of page
AZR: Zero-Data Self-Play Reasoning for LLMs
- Methodology: AZR is a novel LLM training framework that uses self-play and verifiable feedback from an execution environment (Python) to learn reasoning without any human-curated data. It employs three reasoning modes: deduction, abduction, and induction, with a single LLM acting as both proposer and solver of tasks.
- Key Finding: AZR achieves state-of-the-art (SOTA) performance in coding and math reasoning in a zero-data RL with Verifiable Rewards (RLVR) setting, outperforming previous zero-setting models by +1.8 points on average and even surpassing models trained on tens to hundreds of thousands of curated samples. The AZR-Coder-7B model achieves the highest average score across all tested models.
- Generalization: Training AZR in a coding-only environment improves mathematical reasoning performance by up to +15.2 points, significantly more than expert code models trained with RLVR, indicating strong cross-domain generalization.
- Scaling: Performance consistently improves with larger AZR models (3B → 7B → 14B), suggesting scalability.
- Emergent Behaviors: AZR exhibits ReAct-like intermediate planning in code (interleaved comments and logic), trial-and-error strategies in abduction, and systematic state tracking, which are behaviors typically seen in much larger models.
- Safety Considerations: Llama-3.1-8B variants of AZR sometimes produce concerning reasoning chains ("uh-oh moments"), highlighting the need for safety-aware training in autonomous systems.
Source:
bottom of page