Spaces:

MichelM099
/

Legal-Rag-System-TowardsAI

Sleeping

# Basic evaluation (without reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY

# Full evaluation (with reranking)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --cohere-key YOUR_COHERE_KEY

# Quick test (first 5 questions only)
python run_evaluation.py --api-key YOUR_OPENAI_KEY --max-questions 5

Output Files

The script generates two files:

evaluation_results_TIMESTAMP.json - Detailed results with all scores
evaluation_report_TIMESTAMP.md - Human-readable markdown report

Evaluation Dataset

Dataset Statistics

Total Questions: 20
Question Types:
- Factual: 5 questions
- Definitional: 8 questions
- Conceptual: 7 questions
Difficulty Levels:
- Easy: 5 questions
- Medium: 11 questions
- Hard: 4 questions

Sample Questions

"What types of courts are mentioned in the legal documents?"
"What is the burden of proof in civil cases?"
"Explain the difference between civil and criminal cases."
"What are common legal issues in employment cases?"
"What is legal precedent and how does it work?"

Metrics Explained

Faithfulness Score

What it measures: Whether the generated answer is supported by the retrieved source documents.

Scale: 0.0 to 1.0

> 0.8: Excellent - Answer fully grounded in sources
0.6 - 0.8: Good - Mostly supported by sources
< 0.6: Needs improvement - Contains unsupported claims

Example:

Question: "What is a motion to dismiss?"
Good faithfulness: Answer uses exact language from retrieved court documents
Poor faithfulness: Answer includes legal concepts not in the retrieved sources

Relevancy Score

What it measures: Whether the generated answer actually addresses the user's question.

Scale: 0.0 to 1.0

> 0.8: Excellent - Directly answers the question
0.6 - 0.8: Good - Relevant but may be incomplete
< 0.6: Needs improvement - Off-topic or tangential

Example:

Question: "What is the appeals process?"
Good relevancy: Explains how appeals work step-by-step
Poor relevancy: Discusses trial courts instead of appeals

Expected Results

Based on the Legal RAG System configuration, you should expect:

Overall Performance

Metric	Expected Range	Target
Average Faithfulness	0.70 - 0.85	> 0.75
Average Relevancy	0.75 - 0.90	> 0.80
Success Rate	95% - 100%	100%

By Difficulty

Difficulty	Expected Faithfulness	Expected Relevancy
Easy	0.80 - 0.90	0.85 - 0.95
Medium	0.70 - 0.85	0.75 - 0.90
Hard	0.65 - 0.80	0.70 - 0.85

By Query Type

Type	Expected Faithfulness	Expected Relevancy
Factual	0.75 - 0.90	0.80 - 0.95
Definitional	0.70 - 0.85	0.75 - 0.90
Conceptual	0.65 - 0.80	0.70 - 0.85

Interpreting Results

Good Performance Indicators

✅ Faithfulness > 0.75: Answers well-grounded in legal documents ✅ Relevancy > 0.80: Answers directly address legal questions ✅ Consistent performance across difficulty levels ✅ High scores on factual and definitional questions

Areas That May Need Tuning

⚠️ Faithfulness < 0.65: May need better chunking or retrieval ⚠️ Relevancy < 0.70: May need prompt optimization ⚠️ Large variance between difficulty levels ⚠️ Low scores on conceptual questions

Troubleshooting

"No index found" Error

Solution:

# Run the main application first
python legal_rag_system_deployment.py

# Then use the UI to build the index
# Navigate to Setup tab → Enter API keys → Click "Build Index"

Low Faithfulness Scores

Possible causes:

Chunk size too small (documents fragmented)
Not retrieving enough sources
LLM generating information not in sources

Solutions:

Increase CHUNK_SIZE in Config (try 768 or 1024)
Increase SIMILARITY_TOP_K (try 7 or 10)
Modify prompt to emphasize "only use provided sources"

Low Relevancy Scores

Possible causes:

Retrieved documents not relevant to query
LLM not focusing on the question

Solutions:

Enable reranking (use Cohere API key)
Improve embedding model (try text-embedding-3-large)
Add few-shot examples to prompt

API Rate Limits

If you hit OpenAI rate limits:

# Add delays between questions
import time
time.sleep(2)  # Wait 2 seconds between evaluations

Or run with --max-questions 5 to test with fewer questions.

Cost Estimation

For 20 Questions:

Component	Cost per Question	Total (20 questions)
Embeddings	~$0.0001	$0.002
LLM Generation	$0.001 - $0.002	$0.02 - $0.04
Evaluation LLM	$0.001 - $0.002	$0.02 - $0.04
Reranking (optional)	$0.001	$0.02
TOTAL	~$0.003	~$0.06 - $0.10

Full evaluation: < $0.10 ✅

Adding to README

After running evaluation, add this section to your README:

## RAG Evaluation Results

The system has been evaluated on 20 legal questions across different difficulty levels and query types.

### Overall Performance

| Metric | Score |
|--------|-------|
| Average Faithfulness | 0.XX |
| Average Relevancy | 0.XX |
| Success Rate | XX% |

### Key Findings

- ✅ Strong performance on [factual/definitional/conceptual] questions
- ✅ Consistent accuracy across difficulty levels
- ✅ Answers well-grounded in legal source documents

### Evaluation Dataset

- **Size**: 20 questions
- **Types**: Factual, Definitional, Conceptual
- **Difficulty**: Easy, Medium, Hard
- **Coverage**: Court systems, legal procedures, case law, legal concepts

**Full evaluation report**: See `evaluation_report_TIMESTAMP.md`
**Raw results**: See `evaluation_results_TIMESTAMP.json`

Next Steps

Run the evaluation with your API keys
Review the markdown report for detailed results
Add results to README as shown above
Update deployment README with evaluation section
Commit evaluation files to your repository:
- evaluation_dataset.json
- run_evaluation.py
- evaluation_report_TIMESTAMP.md (latest)
- evaluation_results_TIMESTAMP.json (latest)

Completion Checklist

Build vector index (if not already done)
Run evaluation with run_evaluation.py
Review generated markdown report
Verify scores meet expectations (> 0.70 faithfulness, > 0.75 relevancy)
Add evaluation results section to README
Commit evaluation dataset and results to repo
Update project status: Feature #3 COMPLETE ✅

Congratulations! Once you complete this evaluation, you'll have 5 optional features implemented and be ready for certification! 🎉