A newer version of the Gradio SDK is available: 6.16.0
RAG Evaluation Guide
Overview
This guide explains how to run the RAG evaluation on the Legal RAG System to complete Optional Feature #3: RAG Evaluation.
Files Created
evaluation_dataset.json- 20 legal questions with reference answersrun_evaluation.py- Automated evaluation script- This guide - Instructions for running evaluation
Quick Start
Prerequisites
- Vector index must be built first
- OpenAI API key required
- Cohere API key optional (but recommended for full feature testing)
Run Evaluation
# Basic evaluation (without reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY
# Full evaluation (with reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --cohere-key YOUR_COHERE_KEY
# Quick test (first 5 questions only)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --max-questions 5
Output Files
The script generates two files:
evaluation_results_TIMESTAMP.json- Detailed results with all scoresevaluation_report_TIMESTAMP.md- Human-readable markdown report
Evaluation Dataset
Dataset Statistics
- Total Questions: 20
- Question Types:
- Factual: 5 questions
- Definitional: 8 questions
- Conceptual: 7 questions
- Difficulty Levels:
- Easy: 5 questions
- Medium: 11 questions
- Hard: 4 questions
Sample Questions
- "What types of courts are mentioned in the legal documents?"
- "What is the burden of proof in civil cases?"
- "Explain the difference between civil and criminal cases."
- "What are common legal issues in employment cases?"
- "What is legal precedent and how does it work?"
Metrics Explained
Faithfulness Score
What it measures: Whether the generated answer is supported by the retrieved source documents.
Scale: 0.0 to 1.0
- > 0.8: Excellent - Answer fully grounded in sources
- 0.6 - 0.8: Good - Mostly supported by sources
- < 0.6: Needs improvement - Contains unsupported claims
Example:
- Question: "What is a motion to dismiss?"
- Good faithfulness: Answer uses exact language from retrieved court documents
- Poor faithfulness: Answer includes legal concepts not in the retrieved sources
Relevancy Score
What it measures: Whether the generated answer actually addresses the user's question.
Scale: 0.0 to 1.0
- > 0.8: Excellent - Directly answers the question
- 0.6 - 0.8: Good - Relevant but may be incomplete
- < 0.6: Needs improvement - Off-topic or tangential
Example:
- Question: "What is the appeals process?"
- Good relevancy: Explains how appeals work step-by-step
- Poor relevancy: Discusses trial courts instead of appeals
Expected Results
Based on the Legal RAG System configuration, you should expect:
Overall Performance
| Metric | Expected Range | Target |
|---|---|---|
| Average Faithfulness | 0.70 - 0.85 | > 0.75 |
| Average Relevancy | 0.75 - 0.90 | > 0.80 |
| Success Rate | 95% - 100% | 100% |
By Difficulty
| Difficulty | Expected Faithfulness | Expected Relevancy |
|---|---|---|
| Easy | 0.80 - 0.90 | 0.85 - 0.95 |
| Medium | 0.70 - 0.85 | 0.75 - 0.90 |
| Hard | 0.65 - 0.80 | 0.70 - 0.85 |
By Query Type
| Type | Expected Faithfulness | Expected Relevancy |
|---|---|---|
| Factual | 0.75 - 0.90 | 0.80 - 0.95 |
| Definitional | 0.70 - 0.85 | 0.75 - 0.90 |
| Conceptual | 0.65 - 0.80 | 0.70 - 0.85 |
Interpreting Results
Good Performance Indicators
✅ Faithfulness > 0.75: Answers well-grounded in legal documents ✅ Relevancy > 0.80: Answers directly address legal questions ✅ Consistent performance across difficulty levels ✅ High scores on factual and definitional questions
Areas That May Need Tuning
⚠️ Faithfulness < 0.65: May need better chunking or retrieval ⚠️ Relevancy < 0.70: May need prompt optimization ⚠️ Large variance between difficulty levels ⚠️ Low scores on conceptual questions
Troubleshooting
"No index found" Error
Solution:
# Run the main application first
python legal_rag_system_deployment.py
# Then use the UI to build the index
# Navigate to Setup tab → Enter API keys → Click "Build Index"
Low Faithfulness Scores
Possible causes:
- Chunk size too small (documents fragmented)
- Not retrieving enough sources
- LLM generating information not in sources
Solutions:
- Increase
CHUNK_SIZEin Config (try 768 or 1024) - Increase
SIMILARITY_TOP_K(try 7 or 10) - Modify prompt to emphasize "only use provided sources"
Low Relevancy Scores
Possible causes:
- Retrieved documents not relevant to query
- LLM not focusing on the question
Solutions:
- Enable reranking (use Cohere API key)
- Improve embedding model (try text-embedding-3-large)
- Add few-shot examples to prompt
API Rate Limits
If you hit OpenAI rate limits:
# Add delays between questions
import time
time.sleep(2) # Wait 2 seconds between evaluations
Or run with --max-questions 5 to test with fewer questions.
Cost Estimation
For 20 Questions:
| Component | Cost per Question | Total (20 questions) |
|---|---|---|
| Embeddings | ~$0.0001 | $0.002 |
| LLM Generation | $0.001 - $0.002 | $0.02 - $0.04 |
| Evaluation LLM | $0.001 - $0.002 | $0.02 - $0.04 |
| Reranking (optional) | $0.001 | $0.02 |
| TOTAL | ~$0.003 | ~$0.06 - $0.10 |
Full evaluation: < $0.10 ✅
Adding to README
After running evaluation, add this section to your README:
## RAG Evaluation Results
The system has been evaluated on 20 legal questions across different difficulty levels and query types.
### Overall Performance
| Metric | Score |
|--------|-------|
| Average Faithfulness | 0.XX |
| Average Relevancy | 0.XX |
| Success Rate | XX% |
### Key Findings
- ✅ Strong performance on [factual/definitional/conceptual] questions
- ✅ Consistent accuracy across difficulty levels
- ✅ Answers well-grounded in legal source documents
### Evaluation Dataset
- **Size**: 20 questions
- **Types**: Factual, Definitional, Conceptual
- **Difficulty**: Easy, Medium, Hard
- **Coverage**: Court systems, legal procedures, case law, legal concepts
**Full evaluation report**: See `evaluation_report_TIMESTAMP.md`
**Raw results**: See `evaluation_results_TIMESTAMP.json`
Next Steps
- Run the evaluation with your API keys
- Review the markdown report for detailed results
- Add results to README as shown above
- Update deployment README with evaluation section
- Commit evaluation files to your repository:
evaluation_dataset.jsonrun_evaluation.pyevaluation_report_TIMESTAMP.md(latest)evaluation_results_TIMESTAMP.json(latest)
Completion Checklist
- Build vector index (if not already done)
- Run evaluation with
run_evaluation.py - Review generated markdown report
- Verify scores meet expectations (> 0.70 faithfulness, > 0.75 relevancy)
- Add evaluation results section to README
- Commit evaluation dataset and results to repo
- Update project status: Feature #3 COMPLETE ✅
Congratulations! Once you complete this evaluation, you'll have 5 optional features implemented and be ready for certification! 🎉