top of page

RAG for Legal Information Retrieval with NMF and KG

This blog post summarizes a scientific paper that introduces a generative AI system designed to enhance legal information retrieval and AI reasoning. The system integrates Retrieval-Augmented Generation (RAG), Vector Stores (VS), and Knowledge Graphs (KG) constructed via Non-Negative Matrix Factorization (NMF).

Problem Definition

  • Traditional legal information retrieval methods often miss subtle conceptual overlaps and deep contextual cues in legal inquiries.
  • The legal domain is complex, encompassing constitutions, statutes, court rules, regulations, ordinances, and case law.
  • Traditional methods rely on Boolean logic and lexical indexing (TF-IDF), which may not capture conceptual relationships.

Proposed Solution

  • The proposed solution is a generative AI system called Smart-SLIC that combines RAG, Vector Stores (VS), and Knowledge Graphs (KG).
  • The system uses VS for capturing semantic meaning beyond keyword matching. Tools like BERT and GPT are used to embed legal texts into dense vector representations.
  • It employs KG to formalize relationships between legal concepts (statutes, cases, doctrines), enabling structured navigation and explicit linking of legal authorities, using Neo4j.
  • NMF is applied to uncover latent topics and patterns in unstructured text, factorizing word-embedding matrices into interpretable topics. Tensor Extraction of Latent Features (T-ELF) is combined with automatic model selection (NMFk).
  • Limitations include incomplete author attribution in networks, the need for additional datasets, and systematic reconciliation of informal post-decree agreements with formal judgments.

Results

  • The system was tested on a dataset including:
    • Constitution: 265 sections
    • Statutes: 28,251 sections
    • Court of Appeals: 10,072 cases
    • Supreme Court: 5,727 cases
  • Data was decomposed hierarchically with NMFk.
  • Legal citations were collected using chat-gpt-3.5-turbo.
  • The QA performance of the SLIC-SMART system was evaluated with various performance metrics and compared retrieval methods from embedding space.
  • The system was evaluated through case studies, including constitutional, statutory, and case law analyses.

Core Technologies

  • Vector Stores (VS): Embed legal texts into dense vector representations (e.g., BERT, GPT) to capture semantic meanings beyond keyword matching, implemented using Milvus.
  • Knowledge Graphs (KG): Formalize relationships between legal concepts (statutes, cases, doctrines), enabling structured navigation and explicit linking of legal authorities, using Neo4j.
  • Non-Negative Matrix Factorization (NMF): Uncover latent topics and patterns in unstructured text, factorizing word-embedding matrices into interpretable topics.

Importance

  • The system advances computational law by providing a scalable and interpretable method for retrieving and reasoning over complex legal corpora.
  • The experimental results show that chunking combined with hierarchical NMFk improves accuracy.
  • Future directions include refining the citation extraction pipeline and expanding the collection to encompass broader legal instruments.
Source:
bottom of page