Reward Failure Dataset

213 structured encodings of RL reward configurations from 134 published papers (1983-2025) across 18 domains. Each entry encodes the reward structure as typed RewardSource objects with provenance, ground truth labels, and static analysis results from the goodhart tool.

Overview

135 documented failures and 78 well-designed rewards
Every entry traces to a published paper with exact section/equation references
Domains: manipulation, game AI, locomotion, driving, control, multi-agent, navigation, energy, finance, healthcare, RLHF, safety, chip design, fusion, industrial, and more
Sources: built-in examples (66), Krakovna specification gaming catalog (27), Eureka GPT-4 rewards (27), independent papers (93)

Usage

from datasets import load_dataset

ds = load_dataset("audieleon/reward-failure-dataset")

# Filter to structural failures
failures = ds["train"].filter(lambda x: not x["is_well_designed"])

# Filter by domain
driving = ds["train"].filter(lambda x: x["domain"] == "driving")

Fields

Field	Type	Description
`id`	string	Unique identifier
`source_paper`	string	Full citation
`paper_url`	string	DOI, arXiv, or official URL
`year`	int	Publication year
`domain`	string	Application domain
`encoding_basis`	string	primary_source, code_derived, unverified_folklore, or tutorial
`is_well_designed`	bool	True if reward works as intended; False if documented failure
`failure_mechanism`	string	Taxonomy category (idle_exploit, shaping_loop, proxy_reward, etc.)
`detection_type`	string	structural or dynamic
`brief_summary`	string	One-sentence summary of what happened
`documented_failure`	string	Detailed description of the failure
`environment`	object	EnvironmentModel parameters (max_steps, gamma, actions, death_probability)
`reward_sources`	array	Typed RewardSource objects with value, type, behavioral flags
`analysis`	object	goodhart tool results (num_criticals, num_warnings, rules_fired)
`encoding_rationale`	string	JSON-encoded per-flag justification for encoding decisions

Citation

@inproceedings{sheridan2026goodhart,
  title={Catching Goodhart's Law Before Training: A Static Analyzer for Reinforcement Learning Reward Functions},
  author={Sheridan, Audie Leon},
  year={2026}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning