Legal-Rag-System-TowardsAI / EVALUATION_GUIDE.md
MichelM099's picture
Upload 10 files
dae71a3 verified

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

RAG Evaluation Guide

Overview

This guide explains how to run the RAG evaluation on the Legal RAG System to complete Optional Feature #3: RAG Evaluation.

Files Created

  1. evaluation_dataset.json - 20 legal questions with reference answers
  2. run_evaluation.py - Automated evaluation script
  3. This guide - Instructions for running evaluation

Quick Start

Prerequisites

  1. Vector index must be built first
  2. OpenAI API key required
  3. Cohere API key optional (but recommended for full feature testing)

Run Evaluation

# Basic evaluation (without reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY

# Full evaluation (with reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --cohere-key YOUR_COHERE_KEY

# Quick test (first 5 questions only)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --max-questions 5

Output Files

The script generates two files:

  1. evaluation_results_TIMESTAMP.json - Detailed results with all scores
  2. evaluation_report_TIMESTAMP.md - Human-readable markdown report

Evaluation Dataset

Dataset Statistics

  • Total Questions: 20
  • Question Types:
    • Factual: 5 questions
    • Definitional: 8 questions
    • Conceptual: 7 questions
  • Difficulty Levels:
    • Easy: 5 questions
    • Medium: 11 questions
    • Hard: 4 questions

Sample Questions

  1. "What types of courts are mentioned in the legal documents?"
  2. "What is the burden of proof in civil cases?"
  3. "Explain the difference between civil and criminal cases."
  4. "What are common legal issues in employment cases?"
  5. "What is legal precedent and how does it work?"

Metrics Explained

Faithfulness Score

What it measures: Whether the generated answer is supported by the retrieved source documents.

Scale: 0.0 to 1.0

  • > 0.8: Excellent - Answer fully grounded in sources
  • 0.6 - 0.8: Good - Mostly supported by sources
  • < 0.6: Needs improvement - Contains unsupported claims

Example:

  • Question: "What is a motion to dismiss?"
  • Good faithfulness: Answer uses exact language from retrieved court documents
  • Poor faithfulness: Answer includes legal concepts not in the retrieved sources

Relevancy Score

What it measures: Whether the generated answer actually addresses the user's question.

Scale: 0.0 to 1.0

  • > 0.8: Excellent - Directly answers the question
  • 0.6 - 0.8: Good - Relevant but may be incomplete
  • < 0.6: Needs improvement - Off-topic or tangential

Example:

  • Question: "What is the appeals process?"
  • Good relevancy: Explains how appeals work step-by-step
  • Poor relevancy: Discusses trial courts instead of appeals

Expected Results

Based on the Legal RAG System configuration, you should expect:

Overall Performance

Metric Expected Range Target
Average Faithfulness 0.70 - 0.85 > 0.75
Average Relevancy 0.75 - 0.90 > 0.80
Success Rate 95% - 100% 100%

By Difficulty

Difficulty Expected Faithfulness Expected Relevancy
Easy 0.80 - 0.90 0.85 - 0.95
Medium 0.70 - 0.85 0.75 - 0.90
Hard 0.65 - 0.80 0.70 - 0.85

By Query Type

Type Expected Faithfulness Expected Relevancy
Factual 0.75 - 0.90 0.80 - 0.95
Definitional 0.70 - 0.85 0.75 - 0.90
Conceptual 0.65 - 0.80 0.70 - 0.85

Interpreting Results

Good Performance Indicators

✅ Faithfulness > 0.75: Answers well-grounded in legal documents ✅ Relevancy > 0.80: Answers directly address legal questions ✅ Consistent performance across difficulty levels ✅ High scores on factual and definitional questions

Areas That May Need Tuning

⚠️ Faithfulness < 0.65: May need better chunking or retrieval ⚠️ Relevancy < 0.70: May need prompt optimization ⚠️ Large variance between difficulty levels ⚠️ Low scores on conceptual questions

Troubleshooting

"No index found" Error

Solution:

# Run the main application first
python legal_rag_system_deployment.py

# Then use the UI to build the index
# Navigate to Setup tab → Enter API keys → Click "Build Index"

Low Faithfulness Scores

Possible causes:

  • Chunk size too small (documents fragmented)
  • Not retrieving enough sources
  • LLM generating information not in sources

Solutions:

  • Increase CHUNK_SIZE in Config (try 768 or 1024)
  • Increase SIMILARITY_TOP_K (try 7 or 10)
  • Modify prompt to emphasize "only use provided sources"

Low Relevancy Scores

Possible causes:

  • Retrieved documents not relevant to query
  • LLM not focusing on the question

Solutions:

  • Enable reranking (use Cohere API key)
  • Improve embedding model (try text-embedding-3-large)
  • Add few-shot examples to prompt

API Rate Limits

If you hit OpenAI rate limits:

# Add delays between questions
import time
time.sleep(2)  # Wait 2 seconds between evaluations

Or run with --max-questions 5 to test with fewer questions.

Cost Estimation

For 20 Questions:

Component Cost per Question Total (20 questions)
Embeddings ~$0.0001 $0.002
LLM Generation $0.001 - $0.002 $0.02 - $0.04
Evaluation LLM $0.001 - $0.002 $0.02 - $0.04
Reranking (optional) $0.001 $0.02
TOTAL ~$0.003 ~$0.06 - $0.10

Full evaluation: < $0.10

Adding to README

After running evaluation, add this section to your README:

## RAG Evaluation Results

The system has been evaluated on 20 legal questions across different difficulty levels and query types.

### Overall Performance

| Metric | Score |
|--------|-------|
| Average Faithfulness | 0.XX |
| Average Relevancy | 0.XX |
| Success Rate | XX% |

### Key Findings

- ✅ Strong performance on [factual/definitional/conceptual] questions
- ✅ Consistent accuracy across difficulty levels
- ✅ Answers well-grounded in legal source documents

### Evaluation Dataset

- **Size**: 20 questions
- **Types**: Factual, Definitional, Conceptual
- **Difficulty**: Easy, Medium, Hard
- **Coverage**: Court systems, legal procedures, case law, legal concepts

**Full evaluation report**: See `evaluation_report_TIMESTAMP.md`
**Raw results**: See `evaluation_results_TIMESTAMP.json`

Next Steps

  1. Run the evaluation with your API keys
  2. Review the markdown report for detailed results
  3. Add results to README as shown above
  4. Update deployment README with evaluation section
  5. Commit evaluation files to your repository:
    • evaluation_dataset.json
    • run_evaluation.py
    • evaluation_report_TIMESTAMP.md (latest)
    • evaluation_results_TIMESTAMP.json (latest)

Completion Checklist

  • Build vector index (if not already done)
  • Run evaluation with run_evaluation.py
  • Review generated markdown report
  • Verify scores meet expectations (> 0.70 faithfulness, > 0.75 relevancy)
  • Add evaluation results section to README
  • Commit evaluation dataset and results to repo
  • Update project status: Feature #3 COMPLETE ✅

Congratulations! Once you complete this evaluation, you'll have 5 optional features implemented and be ready for certification! 🎉