WebThinker: Autonomous Web Exploration and Report Writing for LRMs

Methodology: WebThinker introduces a framework equipping LRMs with a Deep Web Explorer for autonomous web search/navigation and an Autonomous Think-Search-and-Draft strategy for interleaving reasoning, information gathering, and report writing. It employs online Direct Preference Optimization (DPO) for RL-based training to improve tool usage.
Performance: WebThinker-32B-RL achieves state-of-the-art results among 32B models on complex reasoning benchmarks (GPQA: 70.7%, HLE: 15.8%), outperforming retrieval-augmented and proprietary systems, and excels in scientific report writing on the Glaive dataset (scoring 8.1 in average quality metrics).
RL Refinement: RL-trained versions consistently outperform base counterparts, demonstrating that iterative preference-based learning significantly enhances reasoning-tool coordination.
Ablation Studies: Removing the Deep Web Explorer or automatic report drafting significantly degrades performance, validating the necessity of these components.
Implementation Details: The system uses QwQ-32B or DeepSeek-R1-Distilled-Qwen models as the backbone, Qwen2.5-Instruct as assistant models, Bing Web Search API for search, and Crawl4AI for web page content. Training involves 2 iterations of online DPO.
_According to additional sources:_ The preference data construction for DPO considers overall correctness/quality, tool efficiency (fewer tool calls), and thinking conciseness (output length ratio).