Spaces:
Running on CPU Upgrade
title: CyberSelfPlay (Cyber POSG)
emoji: π‘οΈ
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7870
pinned: true
CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
Training Script Link: League (PFSP + PSRO) β Colab (mixed)
An interactive Game based on Environment: Game
CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
Environment on Hugging Face Space
- Live Space (hub): CyberSelfPlay on Hugging Face
- Running app / API base: https://harshitshri026-cyberselfplay-env.hf.space
- Interactive API (Swagger): https://harshitshri026-cyberselfplay-env.hf.space/docs
- ReDoc: https://harshitshri026-cyberselfplay-env.hf.space/redoc
- Narrative, Colab context, and results figures: Blogs
Problem and Capability Gap
Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
Connection to long-horizon and self-play themes: the setting stresses (super) long-horizon planning and instruction followingβepisodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports self-improvement through interaction: the non-league recipes use SFT followed by GRPO; league methods combine the same SFT initialization and per-round mini-GRPO steps with PFSP / PSRO / mix updates over Red archetypes or pools. Opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving curriculum over the same family of tasks.
Environment Design
Two agents interact in a shared hidden state: Blue (defender) is the trainable side in most recipes; Red (attacker) can be scripted, drawn from a pool, or used as a league opponent. Time advances in discrete steps: each step takes one CyberAction and returns a player-specific CyberObservation (then you alternate or follow your rollout script). The OpenEnv server exposes the same CyberAction / CyberObservation contract over HTTP for remote rollouts and demos.
What the agent observes
CyberObservation is built in cyber_selfplay_env/models.py and returned by CyberSelfPlayEnvironment.reset and step. It includes:
public_state: a partial dict fromCyberSimulator.visible_state(actor). For Blue, expect fields such astime_step, a window ofdetections,business_impact, a coarseknown_incident_count, andinstruction_progresswith counts of completed, violated, and total mission instructions. Redβs public view is different (e.g., limitedknown_targets,high_value_guess_count,detection_pressure) so the game is a true two-sided POSG with distinct observation channels.telemetry: a short list of event-like records (e.g., recent detections for Blue; a compact risk string for Red).incident_summary: episode-level fields includingterminated,winnerwhen set, exfil and time_step (exact keys evolve with simulator state but stay consistent in the client).reward: scalar reward from the environment for the last transition.done: whether the episode has ended.metadata: on normal steps, includesreward_components, rawevents,posg_metrics(aggregates like exfil and instruction completion rates), andcurriculumblock (scenario name, rolling Blue win rate, episode index). The initial reset also carries scenario/actor hints. Invalid tool calls returnmetadata["error"]without terminating the run.
What the agent does
Policies act through CyberAction: actor ("red" | "blue"), tool_name, optional target (host/asset id), params (tool-specific dict, e.g. for execute_instruction), and optional rationale. The environment validates the tool name against the allowed set for that side, then the CyberSimulator applies the effect.
Blue tools (defense and playbook): e.g. query_siem, triage_alerts, isolate_host, disable_account, rotate_secrets, deploy_patch, harden_policy, restore_backup, run_forensics, publish_ioc_blocklist, execute_instruction, checkpoint_plan, reconcile_state.
Red tools (attack chain): e.g. recon_network, enumerate_services, attempt_exploit, dump_credentials, pivot_host, establish_persistence, prepare_exfiltration, execute_exfiltration, cover_tracks, sabotage_recovery_plan.
(Authoritative sets live in cyber_selfplay_env/tools_blue.py and tools_red.py.)
What the agent is rewarded for
Rewards combine security outcomes (detection, containment, recovery, exfiltration pressure) and mission outcomes (instruction progress, checkpoints, violations).
Formal game model
CyberSelfPlay is modeled as a two-player partially observable stochastic game (POSG):
where:
with objective:
with:
and near-zero-sum coupling:
Reward Structure
Red reward
Blue reward
The reward rubric is implemented directly in the environmentβs scoring logic.
Environment Architecture
Training Flow
π Training Approaches in This Project
This project explores multiple training strategies for learning robust Blue policies in the CyberSelfPlay environment.
We experiment across SFT + GRPO baselines, reward smoothing, diversity shaping, and league-based RL where each round still relies on SFT-style warm-start and GRPO (mini-GRPO per round) in addition to PFSP / PSRO opponent scheduling.
π Overview of Training Methods
| Method | Description | Colab | Metrics / Curves |
|---|---|---|---|
| πΉ GRPO (Single-Policy RL) | |||
| SFT β GRPO (Vanilla) | Baseline using only environment reward | Open | ![]() |
| SFT β GRPO (Anti-Collapse) | Adds diversity penalty to avoid mode collapse | Open | ![]() |
| πΉ League (Multi-Policy RL) | |||
| League (PFSP) | SFT, then per-round mini-GRPO; PFSP weights which Red-style opponent to sample | Open | ![]() |
| League (PSRO) | SFT, then per-round mini-GRPO; PSRO-style meta-updates on the opponent population | Open | ![]() |
| League (PFSP + PSRO) | SFT, then per-round mini-GRPO; PFSP sampling and PSRO replicator updates used together | Open | ![]() |
π Mathematical Formulation
1. GRPO (Vanilla)
2. GRPO + Regularization (Anti-Collapse)
where:
3. PFSP (Prioritized Fictitious Self-Play)
where:
4. PSRO (Policy-Space Response Oracles)
where:
5. PFSP + PSRO (Combined)
Combines opponent sampling (PFSP) with meta-policy updates (PSRO).
Core Optimization Math
SFT (Supervised Fine-Tuning)
Token-level cross-entropy (negative log-likelihood) on expert trajectories.
GRPO (Group Relative Policy Optimization)
For prompt $x$, sample a group of completions:
Score each completion with reward $R^{(j)}$, compute group-relative advantages, and update policy parameters with optional KL regularization toward a reference policy $\pi_{\text{ref}}$.
Scenario Scale
| scenario | turns | instructions | checkpoint stride |
|---|---|---|---|
| small | 60 | 40 | 8 |
| medium | 100 | 120 | 12 |
| large | 180 | 300 | 20 |
Instruction progress and violation signals are tracked in environment metadata.
Results Summary
Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, the same SFT and GRPO steps appear inside each round, while PFSP / PSRO / mix reshapes which opponents or archetypes the learner faces; together this yields distinct multi-round learning dynamics and robustness to varied Red behavior.
Common result artifacts produced by training include:
- consolidated training curves,
- step-by-step optimization history,
- metrics logs,
- per-sample reward traces,
- per-step visualization snapshots,
- and, for league experiments, combined multi-round trend and meta-state reports.
Why It Matters
- Security operations relevance: models multi-step defense decisions closer to real incident response.
- Research relevance: provides a reproducible adversarial benchmark for instruction-following under uncertainty.
- Evaluation relevance: combines environment dynamics, tool-structured actions, and measurable outcomes.
Abbreviations
| Short form | Full form |
|---|---|
| SFT | Supervised Fine-Tuning |
| GRPO | Group Relative Policy Optimization |
| TRL | Transformers Reinforcement Learning |
| LoRA | Low-Rank Adaptation |
| PFSP | Prioritized Fictitious Self-Play |
| PSRO | Policy-Space Response Orbit |
| POSG | Partially Observable Stochastic Game |
| POMDP | Partially Observable Markov Decision Process |
| MTTD | Mean Time To Detect |
| MTTR | Mean Time To Repair |
Project Structure (high level)
cyber_selfplay/
βββ cyber_selfplay_env/ # environment core, simulator, rubrics, metrics
βββ server/ # OpenEnv API server
βββ train/
β βββ kaggle_grpo.py
β βββ kaggle_grpo_league.py
β βββ pfsp.py
β βββ psro_meta.py
βββ openenv.yaml
References
- Vinyals et al., Nature 2019 β AlphaStar / league training
- Lanctot et al., NeurIPS 2017 β PSRO
- Hu et al., ACM Transactions on Privacy and Security (TOPS), 2021 β cyber defense POMDP
- TTCP CAGE-2 β defender POMDP framing
- Hugging Face TRL documentation (
GRPOTrainer)




