Reward Failure Dataset
213 structured encodings of RL reward configurations from 134 published papers (1983-2025) across 18 domains. Each entry encodes the reward structure as typed RewardSource objects with provenance, ground truth labels, and static analysis results from the goodhart tool.
Overview
- 135 documented failures and 78 well-designed rewards
- Every entry traces to a published paper with exact section/equation references
- Domains: manipulation, game AI, locomotion, driving, control, multi-agent, navigation, energy, finance, healthcare, RLHF, safety, chip design, fusion, industrial, and more
- Sources: built-in examples (66), Krakovna specification gaming catalog (27), Eureka GPT-4 rewards (27), independent papers (93)
Usage
from datasets import load_dataset
ds = load_dataset("audieleon/reward-failure-dataset")
# Filter to structural failures
failures = ds["train"].filter(lambda x: not x["is_well_designed"])
# Filter by domain
driving = ds["train"].filter(lambda x: x["domain"] == "driving")
Fields
| Field | Type | Description |
|---|---|---|
id |
string | Unique identifier |
source_paper |
string | Full citation |
paper_url |
string | DOI, arXiv, or official URL |
year |
int | Publication year |
domain |
string | Application domain |
encoding_basis |
string | primary_source, code_derived, unverified_folklore, or tutorial |
is_well_designed |
bool | True if reward works as intended; False if documented failure |
failure_mechanism |
string | Taxonomy category (idle_exploit, shaping_loop, proxy_reward, etc.) |
detection_type |
string | structural or dynamic |
brief_summary |
string | One-sentence summary of what happened |
documented_failure |
string | Detailed description of the failure |
environment |
object | EnvironmentModel parameters (max_steps, gamma, actions, death_probability) |
reward_sources |
array | Typed RewardSource objects with value, type, behavioral flags |
analysis |
object | goodhart tool results (num_criticals, num_warnings, rules_fired) |
encoding_rationale |
string | JSON-encoded per-flag justification for encoding decisions |
Citation
@inproceedings{sheridan2026goodhart,
title={Catching Goodhart's Law Before Training: A Static Analyzer for Reinforcement Learning Reward Functions},
author={Sheridan, Audie Leon},
year={2026}
}
License
Apache 2.0