EVO-RAG: RL for Multi-Hop Retrieval Generation

This blog post summarizes a novel approach to enhance the reliability and efficiency of multi-hop Retrieval-Augmented Generation (RAG) systems, addressing common issues such as redundant queries and shallow exploration.

Problem Definition

Multi-hop RAG pipelines often suffer from inefficiencies due to:

Redundant sub-queries that retrieve the same information multiple times.
Shallow exploration of the knowledge base, missing relevant documents.
Over-long search chains, increasing computational costs and potential for errors.

Existing Reinforcement Learning (RL)-based RAG approaches often use static objectives that do not adapt to different phases of the retrieval and generation process.

Proposed Solution

The paper introduces EVO-RAG, a two-stage framework that evolves a query-rewriting agent using curriculum-guided reinforcement learning.

Discovery Stage: Focuses on exploratory behaviors, prioritizing retrieval breadth and query diversity.
Refinement Stage: Emphasizes efficiency and accuracy, fine-tuning retrieval and generation processes.

Multi-Objective Reward Mechanism: A step-level reward vector with seven factors:

Relevance (r_ret): Rewards retrieval of relevant documents.
Redundancy (r_dup): Penalizes redundant sub-queries using cosine similarity of query embeddings generated by sentence-transformers/all-MiniLM-L6-v2.
Efficiency (rstep, ract): Discourages long reasoning chains and penalizes additional search actions.
Answer Correctness (r_ans): Rewards accurate answers at the termination step using Exact Match (EM) and F1 scores.
Backtracking (r_bt): Penalizes unnecessary backtracking.
Refusal (r_ref): Rewards truthful refusal when evidence is insufficient.

Dynamic, Time-Based Reward Scheduler: Adjusts reward component weights throughout each retrieval-generation episode, shifting from broad initial discovery to precise late-stage refinement.
Multi-Head Preference Model: Ranks sibling actions based on cumulative reward using a preference function trained with Direct Preference Optimization (DPO) with a beta value (β) of 0.1 and a reward difference threshold (Δ) of 0.3.

Results

Benchmarks: Evaluated on HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle datasets.
Baselines: Compared against RAG-Gym and IRCoT.
Key Findings:

EVO-RAG improves Exact Match scores by up to 4.6 points compared to strong RAG baselines.
Reduces average retrieval depth by 15%.
Achieves superior performance across all datasets.
Dynamic reward scheduling generally leads to better or comparable accuracy relative to fixed two-stage scheduling.

Importance

EVO-RAG advances multi-stage, multi-objective optimization in RAG systems.
The framework is versatile and scalable, potentially applicable to broader NLP tasks.
It dynamically optimizes the sequence of sub-queries during multi-hop retrieval, addressing a key gap in existing RAG approaches.

Limitations

Evaluation relies exclusively on automatic metrics (Exact Match and F1).
Reward parameters and scheduling were manually tuned primarily on the HotpotQA dataset.
The method employs explicit action prompts rather than fully learned latent actions.
Computational experiments were performed on a single GPU with moderate model sizes (8B parameters) such as Meta-Llama-3.1-8B-Instruct, Qwen3-8B, Deepseek-R1-distill-llama3-8B.

Source:

https://arxiv.org/abs/2505.17391