| # Mass Evaluations |
|
|
| Simple benchmark tool for running predefined prompts through all checkpoints of a model. |
|
|
| ## Usage |
|
|
| ```bash |
| python benchmark.py [model_name] [options] |
| ``` |
|
|
| ## Examples |
|
|
| ```bash |
| # Benchmark all checkpoints of a model |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 |
| |
| # Specify custom output directory |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/ |
| |
| # Use custom prompts file |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json |
| ``` |
|
|
| ## Managing Prompts |
|
|
| Prompts are stored in `prompts.json` as a simple array of strings: |
|
|
| ```json |
| [ |
| "Hello, how are you?", |
| "Complete this story: Once upon a time", |
| "What is the capital of France?" |
| ] |
| ``` |
|
|
| ### Adding New Prompts |
|
|
| Simply edit `prompts.json` and add new prompt strings to the array. Super simple! |
|
|
| ## Features |
|
|
| - **Auto-discovery**: Finds all `step_*` checkpoints automatically |
| - **JSON-based prompts**: Easily customizable prompts via JSON file |
| - **Readable output**: Markdown reports with clear structure |
| - **Error handling**: Continues on failures, logs errors |
| - **Progress tracking**: Shows real-time progress |
| - **Metadata logging**: Includes generation time and parameters |
|
|
| ## Output |
|
|
| Results are saved as markdown files in `results/` directory: |
| ``` |
| results/ |
| βββ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md |
| βββ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md |
| βββ ... |
| ``` |
|
|
| ## Predefined Prompts |
|
|
| 1. "Hello, how are you?" (conversational) |
| 2. "Complete this story: Once upon a time" (creative) |
| 3. "Explain quantum physics in simple terms" (explanatory) |
| 4. "Write a haiku about coding" (creative + structured) |
| 5. "What is the capital of France?" (factual) |
| 6. "The meaning of life is" (philosophical) |
| 7. "In the year 2050," (futuristic) |
| 8. "Python programming is" (technical) |
|
|