top of page
EVO-RAG: RL for Multi-Hop Retrieval Generation
This blog post summarizes a novel approach to enhance the reliability and efficiency of multi-hop Retrieval-Augmented Generation (RAG) systems, addressing common issues such as redundant queries and shallow exploration.
Problem Definition
- Multi-hop RAG pipelines often suffer from inefficiencies due to:
- Redundant sub-queries that retrieve the same information multiple times.
- Shallow exploration of the knowledge base, missing relevant documents.
- Over-long search chains, increasing computational costs and potential for errors.
- Existing Reinforcement Learning (RL)-based RAG approaches often use static objectives that do not adapt to different phases of the retrieval and generation process.
Proposed Solution
- The paper introduces EVO-RAG, a two-stage framework that evolves a query-rewriting agent using curriculum-guided reinforcement learning.
- Discovery Stage: Focuses on exploratory behaviors, prioritizing retrieval breadth and query diversity.
- Refinement Stage: Emphasizes efficiency and accuracy, fine-tuning retrieval and generation processes.
- Multi-Objective Reward Mechanism: A step-level reward vector with seven factors:
- Relevance (
r_ret
): Rewards retrieval of relevant documents. - Redundancy (
r_dup
): Penalizes redundant sub-queries using cosine similarity of query embeddings generated bysentence-transformers/all-MiniLM-L6-v2
. - Efficiency (
rstep
,ract
): Discourages long reasoning chains and penalizes additional search actions. - Answer Correctness (
r_ans
): Rewards accurate answers at the termination step using Exact Match (EM) and F1 scores. - Backtracking (
r_bt
): Penalizes unnecessary backtracking. - Refusal (
r_ref
): Rewards truthful refusal when evidence is insufficient. - Dynamic, Time-Based Reward Scheduler: Adjusts reward component weights throughout each retrieval-generation episode, shifting from broad initial discovery to precise late-stage refinement.
- Multi-Head Preference Model: Ranks sibling actions based on cumulative reward using a preference function trained with Direct Preference Optimization (DPO) with a beta value (β) of 0.1 and a reward difference threshold (Δ) of 0.3.
Results
- Benchmarks: Evaluated on HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle datasets.
- Baselines: Compared against RAG-Gym and IRCoT.
- Key Findings:
- EVO-RAG improves Exact Match scores by up to 4.6 points compared to strong RAG baselines.
- Reduces average retrieval depth by 15%.
- Achieves superior performance across all datasets.
- Dynamic reward scheduling generally leads to better or comparable accuracy relative to fixed two-stage scheduling.
Importance
- EVO-RAG advances multi-stage, multi-objective optimization in RAG systems.
- The framework is versatile and scalable, potentially applicable to broader NLP tasks.
- It dynamically optimizes the sequence of sub-queries during multi-hop retrieval, addressing a key gap in existing RAG approaches.
Limitations
- Evaluation relies exclusively on automatic metrics (Exact Match and F1).
- Reward parameters and scheduling were manually tuned primarily on the HotpotQA dataset.
- The method employs explicit action prompts rather than fully learned latent actions.
- Computational experiments were performed on a single GPU with moderate model sizes (8B parameters) such as Meta-Llama-3.1-8B-Instruct, Qwen3-8B, Deepseek-R1-distill-llama3-8B.
Source:
bottom of page