Papers
arxiv:2603.01710

Legal RAG Bench: an end-to-end benchmark for legal RAG

Published on Mar 2
· Submitted by
Umar Butler
on Mar 3

Abstract

Legal RAG Bench evaluates legal retrieval-augmented generation systems using a comprehensive dataset and factorial analysis, revealing that information retrieval significantly impacts performance more than language model capabilities.

AI-generated summary

We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.

Community

Paper author Paper submitter

Hey Hugging Face,
This is Legal RAG Bench, the first benchmark for legal RAG systems to simultaneously evaluate hallucinations, retrieval failures, and reasoning errors.

The key takeaways of our benchmark are:

  1. Embedding models, not generative models, are the primary driver of RAG accuracy. Switching from a general-purpose embedder like OpenAI's Text Embedding 3 Large to a legal domain embedder like Kanon 2 Embedder can raise accuracy by ~19 points.
  2. Hallucinations are often triggered by retrieval failures. Fix your retrieval stack, and, in most cases, you end up fixing hallucinations.
  3. Once you have a solid legal retrieval engine, it doesn’t matter as much what generative model you use; GPT-5.2 and Gemini 3.1 Pro perform relatively similarly, with Gemini 3.1 Pro achieving slightly better accuracy than GPT-5.2 at the cost of more hallucinations.

These findings confirm what we already suspected: that information retrieval sets the ceiling on the accuracy of legal RAG systems. It doesn’t matter how smart you are; you aren’t going to magically know what the penalty is for speeding in California without access to an up-to-date copy of the California Vehicle Code. Even still, to our knowledge, we’re the first to actually show this empirically.

In the interests of transparency, we have not only detailed exactly how we built Legal RAG Bench in our paper, but we’ve also released all of our data openly here on Hugging Face in addition to an interactive data explorer on our blog showing the full results of all evaluated models on Legal RAG Bench.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.01710 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.01710 in a Space README.md to link it from this page.

Collections including this paper 1