Absolute Zero: Self-Learning AI for Reasoning

Methodology: Absolute Zero Reasoner (AZR) introduces a novel paradigm where a single LLM self-generates and solves reasoning tasks using a code executor for verifiable feedback, eliminating reliance on human-curated data. It combines inductive, abductive, and deductive code challenges to create a self-evolving curriculum.
Implementation: A unified LLM acts as both proposer and solver, guided by a learnability reward to craft tasks of moderate complexity. Training is end-to-end using Task-Relative REINFORCE++. The code executor validates proposed code reasoning tasks and verifies answers.
Performance: AZR achieves state-of-the-art results in coding and mathematical benchmarks, surpassing specialized models trained on human-curated datasets.
Generalization: Exhibits robust cross-domain transfer and scaling gains, indicating strong generalization capabilities.
Ablation Studies: Performance significantly drops when removing induction or using only deduction, highlighting the importance of task type diversity. Removing conditioning on K references and omitting proposer-role training also degrades performance.
Limitations: The system restricts sensitive Python packages (e.g., `os.sys`, `subprocess`) to ensure program safety and checks for determinism, but safety concerns necessitate safety-aware training due to the risk of emergent undesirable behaviors.