Reward Failure Dataset

213 structured encodings of RL reward configurations from 134 published papers (1983-2025) across 18 domains. Each entry encodes the reward structure as typed RewardSource objects with provenance, ground truth labels, and static analysis results from the goodhart tool.

Overview

  • 135 documented failures and 78 well-designed rewards
  • Every entry traces to a published paper with exact section/equation references
  • Domains: manipulation, game AI, locomotion, driving, control, multi-agent, navigation, energy, finance, healthcare, RLHF, safety, chip design, fusion, industrial, and more
  • Sources: built-in examples (66), Krakovna specification gaming catalog (27), Eureka GPT-4 rewards (27), independent papers (93)

Usage

from datasets import load_dataset

ds = load_dataset("audieleon/reward-failure-dataset")

# Filter to structural failures
failures = ds["train"].filter(lambda x: not x["is_well_designed"])

# Filter by domain
driving = ds["train"].filter(lambda x: x["domain"] == "driving")

Fields

Field Type Description
id string Unique identifier
source_paper string Full citation
paper_url string DOI, arXiv, or official URL
year int Publication year
domain string Application domain
encoding_basis string primary_source, code_derived, unverified_folklore, or tutorial
is_well_designed bool True if reward works as intended; False if documented failure
failure_mechanism string Taxonomy category (idle_exploit, shaping_loop, proxy_reward, etc.)
detection_type string structural or dynamic
brief_summary string One-sentence summary of what happened
documented_failure string Detailed description of the failure
environment object EnvironmentModel parameters (max_steps, gamma, actions, death_probability)
reward_sources array Typed RewardSource objects with value, type, behavioral flags
analysis object goodhart tool results (num_criticals, num_warnings, rules_fired)
encoding_rationale string JSON-encoded per-flag justification for encoding decisions

Citation

@inproceedings{sheridan2026goodhart,
  title={Catching Goodhart's Law Before Training: A Static Analyzer for Reinforcement Learning Reward Functions},
  author={Sheridan, Audie Leon},
  year={2026}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading