top of page

EVO-RAG: RL for Multi-Hop Retrieval Generation

This blog post summarizes a novel approach to enhance the reliability and efficiency of multi-hop Retrieval-Augmented Generation (RAG) systems, addressing common issues such as redundant queries and shallow exploration.

Problem Definition

  • Multi-hop RAG pipelines often suffer from inefficiencies due to:
    • Redundant sub-queries that retrieve the same information multiple times.
    • Shallow exploration of the knowledge base, missing relevant documents.
    • Over-long search chains, increasing computational costs and potential for errors.
  • Existing Reinforcement Learning (RL)-based RAG approaches often use static objectives that do not adapt to different phases of the retrieval and generation process.

Proposed Solution

  • The paper introduces EVO-RAG, a two-stage framework that evolves a query-rewriting agent using curriculum-guided reinforcement learning.
    • Discovery Stage: Focuses on exploratory behaviors, prioritizing retrieval breadth and query diversity.
    • Refinement Stage: Emphasizes efficiency and accuracy, fine-tuning retrieval and generation processes.
  • Multi-Objective Reward Mechanism: A step-level reward vector with seven factors:
    • Relevance (r_ret): Rewards retrieval of relevant documents.
    • Redundancy (r_dup): Penalizes redundant sub-queries using cosine similarity of query embeddings generated by sentence-transformers/all-MiniLM-L6-v2.
    • Efficiency (rstep, ract): Discourages long reasoning chains and penalizes additional search actions.
    • Answer Correctness (r_ans): Rewards accurate answers at the termination step using Exact Match (EM) and F1 scores.
    • Backtracking (r_bt): Penalizes unnecessary backtracking.
    • Refusal (r_ref): Rewards truthful refusal when evidence is insufficient.
  • Dynamic, Time-Based Reward Scheduler: Adjusts reward component weights throughout each retrieval-generation episode, shifting from broad initial discovery to precise late-stage refinement.
  • Multi-Head Preference Model: Ranks sibling actions based on cumulative reward using a preference function trained with Direct Preference Optimization (DPO) with a beta value (β) of 0.1 and a reward difference threshold (Δ) of 0.3.

Results

  • Benchmarks: Evaluated on HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle datasets.
  • Baselines: Compared against RAG-Gym and IRCoT.
  • Key Findings:
    • EVO-RAG improves Exact Match scores by up to 4.6 points compared to strong RAG baselines.
    • Reduces average retrieval depth by 15%.
    • Achieves superior performance across all datasets.
    • Dynamic reward scheduling generally leads to better or comparable accuracy relative to fixed two-stage scheduling.

Importance

  • EVO-RAG advances multi-stage, multi-objective optimization in RAG systems.
  • The framework is versatile and scalable, potentially applicable to broader NLP tasks.
  • It dynamically optimizes the sequence of sub-queries during multi-hop retrieval, addressing a key gap in existing RAG approaches.

Limitations

  • Evaluation relies exclusively on automatic metrics (Exact Match and F1).
  • Reward parameters and scheduling were manually tuned primarily on the HotpotQA dataset.
  • The method employs explicit action prompts rather than fully learned latent actions.
  • Computational experiments were performed on a single GPU with moderate model sizes (8B parameters) such as Meta-Llama-3.1-8B-Instruct, Qwen3-8B, Deepseek-R1-distill-llama3-8B.
Source:
bottom of page