top of page
NitiBench: Evaluating LLMs on Thai Legal Question Answering
This blog post summarizes key findings from the NitiBench paper, which introduces a new benchmark for Thai legal question answering (QA). The study evaluates various Retrieval-Augmented Generation (RAG) and Long-Context LLM (LCLM) approaches, providing insights into their performance and limitations in the Thai legal domain.
Problem Definition
- Challenge: Existing legal QA systems, particularly in Thai, lack standardized evaluation methods and struggle with the complexity of legal documents.
- Need for Benchmarking: Benchmarks are crucial for evaluating legal QA systems, offering standardized tasks and metrics. While some benchmarks exist, they often focus on specific sub-tasks rather than free-form QA.
- NitiBench Introduction: The paper introduces NitiBench to address these challenges, providing a resource for evaluating Thai legal QA systems.
Proposed Solution
- NitiBench Dataset: A new benchmark dataset for Thai legal QA, consisting of two datasets:
- NitiBench-CCL: General Thai Financial Law, comprising 3,730 entries.
- NitiBench-Tax: Real-world Thai Tax Law cases, consisting of 50 cases.
- Task Definition: The task involves generating an accurate answer to a query and providing cited legal sections.
- Hierarchy-aware Chunking: Segments documents by legal sections to address the hierarchical structure of Thai legal documents.
- NitiLink: Augments retrieved sections with additional referenced sections to address inter-section references.
Results
- Hierarchy-aware Chunking: Achieves a slight but consistent advantage over naive chunking strategies.
- NitiLink: Does not show a clear significant advantage in a RAG system.
- NitiBench-Tax: Recall improved, but some end-to-end metrics declined.
- NitiBench-CCL: No significant change in retrieval or end-to-end metrics.
- Retriever Performance:
- NitiBench-CCL: Human-reranked fine-tuned BGE-M3 performs best.
- NitiBench-Tax: Overall performance is significantly lower.
- LLM Performance:
- Claude-3.5-sonnet excels in Coverage, Contradiction, and end-to-end Recall across both datasets.
- Long-Context LLM (LCLM): Performs comparably to the parametric setting on NitiBench-Tax and to the Naive RAG system on NitiBench-CCL.
Importance
- Standardized Evaluation: NitiBench provides a standardized benchmark for evaluating Thai legal QA systems, addressing a critical gap in the field.
- Insights into RAG and LCLM: The study offers valuable insights into the performance of RAG and LCLM approaches in the Thai legal domain.
- Identified Challenges: The error analysis highlights specific challenges, such as hidden hierarchical information, nested structures, missable details, and complex queries.
- Multi-label Metrics: The paper proposes multi-label retrieval metrics (Multi-MRR and Multi-Hit Rate) that correlate more effectively with end-to-end metrics compared to conventional metrics.
Source:
bottom of page