top of page

NitiBench: Evaluating LLMs on Thai Legal Question Answering

This blog post summarizes key findings from the NitiBench paper, which introduces a new benchmark for Thai legal question answering (QA). The study evaluates various Retrieval-Augmented Generation (RAG) and Long-Context LLM (LCLM) approaches, providing insights into their performance and limitations in the Thai legal domain.

Problem Definition

  • Challenge: Existing legal QA systems, particularly in Thai, lack standardized evaluation methods and struggle with the complexity of legal documents.
  • Need for Benchmarking: Benchmarks are crucial for evaluating legal QA systems, offering standardized tasks and metrics. While some benchmarks exist, they often focus on specific sub-tasks rather than free-form QA.
  • NitiBench Introduction: The paper introduces NitiBench to address these challenges, providing a resource for evaluating Thai legal QA systems.

Proposed Solution

  • NitiBench Dataset: A new benchmark dataset for Thai legal QA, consisting of two datasets:
    • NitiBench-CCL: General Thai Financial Law, comprising 3,730 entries.
    • NitiBench-Tax: Real-world Thai Tax Law cases, consisting of 50 cases.
  • Task Definition: The task involves generating an accurate answer to a query and providing cited legal sections.
  • Hierarchy-aware Chunking: Segments documents by legal sections to address the hierarchical structure of Thai legal documents.
  • NitiLink: Augments retrieved sections with additional referenced sections to address inter-section references.

Results

  • Hierarchy-aware Chunking: Achieves a slight but consistent advantage over naive chunking strategies.
  • NitiLink: Does not show a clear significant advantage in a RAG system.
    • NitiBench-Tax: Recall improved, but some end-to-end metrics declined.
    • NitiBench-CCL: No significant change in retrieval or end-to-end metrics.
  • Retriever Performance:
    • NitiBench-CCL: Human-reranked fine-tuned BGE-M3 performs best.
    • NitiBench-Tax: Overall performance is significantly lower.
  • LLM Performance:
    • Claude-3.5-sonnet excels in Coverage, Contradiction, and end-to-end Recall across both datasets.
  • Long-Context LLM (LCLM): Performs comparably to the parametric setting on NitiBench-Tax and to the Naive RAG system on NitiBench-CCL.

Importance

  • Standardized Evaluation: NitiBench provides a standardized benchmark for evaluating Thai legal QA systems, addressing a critical gap in the field.
  • Insights into RAG and LCLM: The study offers valuable insights into the performance of RAG and LCLM approaches in the Thai legal domain.
  • Identified Challenges: The error analysis highlights specific challenges, such as hidden hierarchical information, nested structures, missable details, and complex queries.
  • Multi-label Metrics: The paper proposes multi-label retrieval metrics (Multi-MRR and Multi-Hit Rate) that correlate more effectively with end-to-end metrics compared to conventional metrics.
Source:
bottom of page