top of page
HM-RAG: Hierarchical Multi-Agent Multimodal RAG
- HM-RAG introduces a novel three-tiered architecture for RAG, featuring a Decomposition Agent for query rewriting, Multi-source Retrieval Agents for parallel modality-specific retrieval (vector, graph, web), and a Decision Agent for consistency voting and expert model refinement. This addresses limitations of single-agent RAG in complex, heterogeneous data environments.
- The Multimodal Knowledge Pre-Processing stage uses BLIP-2 to convert visual information into textual representations (`Tv`), refining over-condensed outputs via contextual mechanisms and integrating them with original text corpora (`T`) to construct vector and graph databases (`Tm = Concat(T, Tv)`).
- The Graph-based Retrieval Agent leverages LightRAG to construct context-aware subgraphs (`Gq ⊆ G`) by dynamically retrieving entities and relations through cross-modal attention, using a hierarchical search strategy balancing efficiency and comprehensiveness.
- HM-RAG achieves state-of-the-art results on ScienceQA (93.73% average accuracy, a 12.95% improvement over vector-based baselines) and CrisisMMD (58.55% average accuracy, a 2.44% improvement over GPT-4o), demonstrating effectiveness in multimodal question answering and classification.
- Ablation studies on ScienceQA reveal the Decision Agent's critical role, with its removal causing a 10.82% performance decline, particularly impacting image-based and social reasoning tasks. The web-based retrieval agent also shows robust integration capabilities.
- The framework's design facilitates modularity and scalability, enabling seamless integration of new data modalities while maintaining data governance, marking a significant advancement in multimodal reasoning and knowledge synthesis in RAG systems.
Source:
bottom of page