top of page

NemoTron-CrossThink: RL for Enhanced LLM Reasoning

  • NemoTron-CrossThink is a novel RL framework that improves LLM reasoning by blending synthetic and open-source Q&A pairs across STEM, humanities, and social sciences. This addresses the challenge of verifiable rewards outside mathematics by using structured templates (MCQ and open-ended) and filtering for scrutable answers.
  • The framework curates a diverse dataset, _D = Dsyn ∪ Dos_, comprising synthetic data from Common Crawl and open-source QA datasets, applying templates (_Dmcq = TMCQ(Dgpr), Dopen = TOpen(Dgpr)_) and filtering steps (_H_) to ensure reward compatibility.
  • Methodology: Employs Group Relative Policy Optimization (GRPO) with a rule-based reward function (_R = Racc ∧ Rformat_) that combines accuracy and formatting criteria. The GRPO objective function is defined as:
  • ```
  • JGRPO(θ) = E q ∼ P(Q), {oi}i[G]=1 [∼] [π][θ]old [(][O][|][q][)]
  • πθ(oi,t|q, oi,<t)
  • πθold (oi,t|q, oi,<t) [, 1][ −] [ϵ][, 1][ +][ ϵ]
  • Aˆ i,t
  • |oi|
  • t=1
  • × [1]
  • G
  • G
  • i=1
  • 1
  • |oi|
  • min
  • ππθoldθ((ooi,it,|tq|q, o, oi,i<,<t)t) Aˆ i,t, clip
  • − βDKLπθ∥πref
  • DKL [πθ∥πref] = [π]π[ref]θ([(]o[o]i,[i]t[,][t]|[|]q[q], o[,][ o]i,[i]<[,][<]t[t])[)] [−] [log] [π]π[ref]θ([(]o[o]i,[i]t[,][t]|[|]q[q], o[,][ o]i,[i]<[,][<]t[t])[)] [−] [1.] (1)
  • ```
  • Results: Applied to Qwen-2.5 models, NemoTron-CrossThink achieves significant gains: +30.1% on MATH-500, +27.5% on AMC 23, +12.8% on MMLU-Pro, +15.1% on AGIEval, and reduces inference tokens by 28%.
  • Ablation studies demonstrate that NemoTron-CrossThink improves token efficiency and that data diversity, rather than just volume, is key to broader reasoning capabilities. Difficulty filtering, which removes 'easy' questions, was also explored.
  • Implementation Details: GRPO training was performed using the veRL framework with a constant learning rate of 1e-6, a batch size and PPO mini batch size of 128, and a maximum context length of 5000 tokens. The KL coefficient was set to 0.001.
Source:
bottom of page