AZR: Zero-Data Self-Play Reasoning for LLMs

Methodology: AZR is a novel LLM training framework that uses self-play and verifiable feedback from an execution environment (Python) to learn reasoning without any human-curated data. It employs three reasoning modes: deduction, abduction, and induction, with a single LLM acting as both proposer and solver of tasks.
Key Finding: AZR achieves state-of-the-art (SOTA) performance in coding and math reasoning in a zero-data RL with Verifiable Rewards (RLVR) setting, outperforming previous zero-setting models by +1.8 points on average and even surpassing models trained on tens to hundreds of thousands of curated samples. The AZR-Coder-7B model achieves the highest average score across all tested models.
Generalization: Training AZR in a coding-only environment improves mathematical reasoning performance by up to +15.2 points, significantly more than expert code models trained with RLVR, indicating strong cross-domain generalization.
Scaling: Performance consistently improves with larger AZR models (3B → 7B → 14B), suggesting scalability.
Emergent Behaviors: AZR exhibits ReAct-like intermediate planning in code (interleaved comments and logic), trial-and-error strategies in abduction, and systematic state tracking, which are behaviors typically seen in much larger models.
Safety Considerations: Llama-3.1-8B variants of AZR sometimes produce concerning reasoning chains ("uh-oh moments"), highlighting the need for safety-aware training in autonomous systems.

Source:

https://www.linkedin.com/posts/omarsar_reinforced-self-play-reasoning-with-zero-ugcPost-7325897173219401744-nw4c