top of page
NemoTron-CrossThink: RL for Enhanced LLM Reasoning
- NemoTron-CrossThink is a novel RL framework that improves LLM reasoning by blending synthetic and open-source Q&A pairs across STEM, humanities, and social sciences. This addresses the challenge of verifiable rewards outside mathematics by using structured templates (MCQ and open-ended) and filtering for scrutable answers.
- The framework curates a diverse dataset, _D = Dsyn ∪ Dos_, comprising synthetic data from Common Crawl and open-source QA datasets, applying templates (_Dmcq = TMCQ(Dgpr), Dopen = TOpen(Dgpr)_) and filtering steps (_H_) to ensure reward compatibility.
- Methodology: Employs Group Relative Policy Optimization (GRPO) with a rule-based reward function (_R = Racc ∧ Rformat_) that combines accuracy and formatting criteria. The GRPO objective function is defined as:
- ```
- JGRPO(θ) = E q ∼ P(Q), {oi}i[G]=1 [∼] [π][θ]old [(][O][|][q][)]
- πθ(oi,t|q, oi,<t)
- πθold (oi,t|q, oi,<t) [, 1][ −] [ϵ][, 1][ +][ ϵ]
- Aˆ i,t
- |oi|
- t=1
- × [1]
- G
- G
- i=1
- 1
- |oi|
- min
- ππθoldθ((ooi,it,|tq|q, o, oi,i<,<t)t) Aˆ i,t, clip
- − βDKLπθ∥πref
- DKL [πθ∥πref] = [π]π[ref]θ([(]o[o]i,[i]t[,][t]|[|]q[q], o[,][ o]i,[i]<[,][<]t[t])[)] [−] [log] [π]π[ref]θ([(]o[o]i,[i]t[,][t]|[|]q[q], o[,][ o]i,[i]<[,][<]t[t])[)] [−] [1.] (1)
- ```
- Results: Applied to Qwen-2.5 models, NemoTron-CrossThink achieves significant gains: +30.1% on MATH-500, +27.5% on AMC 23, +12.8% on MMLU-Pro, +15.1% on AGIEval, and reduces inference tokens by 28%.
- Ablation studies demonstrate that NemoTron-CrossThink improves token efficiency and that data diversity, rather than just volume, is key to broader reasoning capabilities. Difficulty filtering, which removes 'easy' questions, was also explored.
- Implementation Details: GRPO training was performed using the veRL framework with a constant learning rate of 1e-6, a batch size and PPO mini batch size of 128, and a maximum context length of 5000 tokens. The KL coefficient was set to 0.001.
Source:
bottom of page