Holmes: LLM-Assisted Automated Fact-Checking Framework

Holmes is an end-to-end framework for automated fact-checking of multimodal disinformation, leveraging LLMs for evidence retrieval and claim verification.* The framework addresses the limitations of LLMs in autonomously finding accurate and relevant evidence by using a novel evidence retrieval methodology.
The core innovation is a two-step evidence retrieval process:* First, LLMs summarize open-source information based on predefined rules (People, Event, Location, Time, Reason, Background, Impact, Follow-up). Second, a new algorithm evaluates the quality of extracted evidence based on credibility, relevance (cosine similarity between claim and evidence embeddings using BLIP-2), and integrity (coverage of event arguments extracted using ChatIE). The evidence score `EQ(ei) = α · Rel(ei, C) + (1 − α) · Int(ei)` balances relevance and integrity, with α set to 0.5.
Empirical studies reveal that LLMs alone struggle to verify disinformation without external evidence.* LLMs often fail to access specific information, leading to inaccurate judgments, especially for claims emerging after their knowledge cutoff date. Zero-shot prompting and Chain-of-Thought (CoT) achieve low success rates (e.g., GPT-4o at 5.3% with CoT).
Providing LLMs with human-written evidence significantly improves fact-checking accuracy.* GPT-4o and Gemini-1.5-flash achieve success rates of 90.3% and 92.9%, respectively, when supplied with relevant evidence. However, without evidence retrieval, LLMs may hallucinate links, undermining trust.
Holmes achieves state-of-the-art accuracy in multimodal disinformation detection.* It attains 88.3% accuracy on open-source datasets and 90.2% in real-time verification, outperforming deep learning baselines. The improved evidence retrieval boosts fact-checking accuracy by 30.8% compared to existing methods.
Ablation studies confirm the effectiveness of Holmes' evidence retrieval method.* Compared to existing methods, Holmes extracts more abundant and comprehensive information from original web pages, enabling more accurate disinformation verification by LLMs. When comparing LLMs with search capabilities (GPT-4o-search-preview, GPT-4o-mini-search-preview, and Gemini-1.5-flash-search-grounding) to Holmes, Holmes still performs better on the verification task.

Source:

https://arxiv.org/pdf/2505.03135v1