NitiBench: Evaluating LLMs on Thai Legal Question Answering

This blog post summarizes key findings from the NitiBench paper, which introduces a new benchmark for Thai legal question answering (QA). The study evaluates various Retrieval-Augmented Generation (RAG) and Long-Context LLM (LCLM) approaches, providing insights into their performance and limitations in the Thai legal domain.

Problem Definition

Challenge: Existing legal QA systems, particularly in Thai, lack standardized evaluation methods and struggle with the complexity of legal documents.
Need for Benchmarking: Benchmarks are crucial for evaluating legal QA systems, offering standardized tasks and metrics. While some benchmarks exist, they often focus on specific sub-tasks rather than free-form QA.
NitiBench Introduction: The paper introduces NitiBench to address these challenges, providing a resource for evaluating Thai legal QA systems.

Proposed Solution

NitiBench Dataset: A new benchmark dataset for Thai legal QA, consisting of two datasets:

NitiBench-CCL: General Thai Financial Law, comprising 3,730 entries.
NitiBench-Tax: Real-world Thai Tax Law cases, consisting of 50 cases.

Task Definition: The task involves generating an accurate answer to a query and providing cited legal sections.
Hierarchy-aware Chunking: Segments documents by legal sections to address the hierarchical structure of Thai legal documents.
NitiLink: Augments retrieved sections with additional referenced sections to address inter-section references.

Results

Hierarchy-aware Chunking: Achieves a slight but consistent advantage over naive chunking strategies.
NitiLink: Does not show a clear significant advantage in a RAG system.

NitiBench-Tax: Recall improved, but some end-to-end metrics declined.
NitiBench-CCL: No significant change in retrieval or end-to-end metrics.

Retriever Performance:

NitiBench-CCL: Human-reranked fine-tuned BGE-M3 performs best.
NitiBench-Tax: Overall performance is significantly lower.

LLM Performance:

Claude-3.5-sonnet excels in Coverage, Contradiction, and end-to-end Recall across both datasets.

Long-Context LLM (LCLM): Performs comparably to the parametric setting on NitiBench-Tax and to the Naive RAG system on NitiBench-CCL.

Importance

Standardized Evaluation: NitiBench provides a standardized benchmark for evaluating Thai legal QA systems, addressing a critical gap in the field.
Insights into RAG and LCLM: The study offers valuable insights into the performance of RAG and LCLM approaches in the Thai legal domain.
Identified Challenges: The error analysis highlights specific challenges, such as hidden hierarchical information, nested structures, missable details, and complex queries.
Multi-label Metrics: The paper proposes multi-label retrieval metrics (Multi-MRR and Multi-Hit Rate) that correlate more effectively with end-to-end metrics compared to conventional metrics.

Source:

https://arxiv.org/pdf/2502.10868