top of page
ReTool: RL for Strategic Tool Use in LLMs
- ReTool is introduced as a novel reinforcement learning (RL) framework that integrates code interpreter execution to enhance strategic tool use in LLMs, addressing the limitations of text-based RL approaches.
- The methodology involves a cold-start approach using a reasoning dataset (Dinit) to construct code-integrated reasoning data (DCI), which enhances the model’s ability to utilize computational tools appropriately.
- The training algorithm employs Proximal Policy Optimization (PPO) within the VeRL framework, with a reward function R(a, â) that assigns a reward of 1 for equivalent ground-truth and predicted answers, and -1 otherwise, to guide the model in learning when and how to invoke tools.
- To facilitate the integration of reasoning and executable code, the rollout process uses tags `
` to mark code boundaries; when a code termination trigger is detected, the code is executed in a sandbox, and the output is fed back to the model within `` tags.
- Training details include masking the `` feedback output from the loss, KV-Cache reuse to reduce memory cost during rollout by caching KV-cache before code execution, and an asynchronous code sandbox environment to accelerate the RL training process.
- Cognitive analysis of ReTool during RL training reveals an emergent ability for code self-correction, where the model identifies and corrects errors in its generated code, demonstrating improved reasoning and tool utilization. An example shows the model correcting a `NameError` by defining the `greedy` function in the correct scope.
Source:
bottom of page