--- title: CyberSelfPlay (Cyber POSG) emoji: 🛡️ colorFrom: blue colorTo: red sdk: docker app_port: 7870 pinned: true --- # CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment **Training Script Link:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) **An interactive Game based on Environment:** [Game](https://openenv-ui.vercel.app/) CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure. ## Environment on Hugging Face Space - **Live Space (hub):** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026) - **Running app / API base:** [https://harshitshri026-cyberselfplay-env.hf.space](https://harshitshri026-cyberselfplay-env.hf.space/) - **Interactive API (Swagger):** [https://harshitshri026-cyberselfplay-env.hf.space/docs](https://harshitshri026-cyberselfplay-env.hf.space/docs) - **ReDoc:** [https://harshitshri026-cyberselfplay-env.hf.space/redoc](https://harshitshri026-cyberselfplay-env.hf.space/redoc) - **Narrative, Colab context, and results figures:** [Blogs](Blog.md) --- ## Problem and Capability Gap Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions. **Connection to long-horizon and self-play themes:** the setting stresses **(super) long-horizon planning and instruction following**—episodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports **self-improvement through interaction**: the non-league recipes use **SFT** followed by **GRPO**; **league** methods combine the same **SFT** initialization and per-round **mini-GRPO** steps with **PFSP / PSRO / mix** updates over Red archetypes or pools. Opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving curriculum over the same family of tasks. --- ## Environment Design Two **agents** interact in a shared hidden state: **Blue** (defender) is the trainable side in most recipes; **Red** (attacker) can be scripted, drawn from a pool, or used as a **league** opponent. Time advances in **discrete steps**: each `step` takes one `CyberAction` and returns a **player-specific** `CyberObservation` (then you alternate or follow your rollout script). The OpenEnv **server** exposes the same `CyberAction` / `CyberObservation` contract over HTTP for remote rollouts and demos. ### What the agent observes `CyberObservation` is built in `cyber_selfplay_env/models.py` and returned by `CyberSelfPlayEnvironment.reset` and `step`. It includes: - **`public_state`**: a **partial** dict from `CyberSimulator.visible_state(actor)`. For Blue, expect fields such as `time_step`, a window of `detections`, `business_impact`, a coarse `known_incident_count`, and `instruction_progress` with counts of **completed**, **violated**, and **total** mission instructions. Red’s public view is different (e.g., limited `known_targets`, `high_value_guess_count`, `detection_pressure`) so the game is a true two-sided **POSG** with distinct observation channels. - **`telemetry`**: a short list of event-like records (e.g., recent detections for Blue; a compact risk string for Red). - **`incident_summary`**: episode-level fields including `terminated`, `winner` when set, exfil and **time_step** (exact keys evolve with simulator state but stay consistent in the client). - **`reward`**: scalar reward from the environment for the last transition. - **`done`**: whether the episode has ended. - **`metadata`**: on normal steps, includes **`reward_components`**, raw **`events`**, **`posg_metrics`** (aggregates like exfil and instruction completion rates), and **`curriculum`** block (scenario name, rolling Blue win rate, episode index). The initial reset also carries scenario/actor hints. Invalid tool calls return **`metadata["error"]`** without terminating the run. ### What the agent does Policies act through **`CyberAction`**: `actor` (`"red"` | `"blue"`), `tool_name`, optional `target` (host/asset id), `params` (tool-specific `dict`, e.g. for `execute_instruction`), and optional `rationale`. The environment validates the tool name against the allowed set for that side, then the **CyberSimulator** applies the effect. **Blue tools** (defense and playbook): e.g. `query_siem`, `triage_alerts`, `isolate_host`, `disable_account`, `rotate_secrets`, `deploy_patch`, `harden_policy`, `restore_backup`, `run_forensics`, `publish_ioc_blocklist`, `execute_instruction`, `checkpoint_plan`, `reconcile_state`. **Red tools** (attack chain): e.g. `recon_network`, `enumerate_services`, `attempt_exploit`, `dump_credentials`, `pivot_host`, `establish_persistence`, `prepare_exfiltration`, `execute_exfiltration`, `cover_tracks`, `sabotage_recovery_plan`. (Authoritative sets live in `cyber_selfplay_env/tools_blue.py` and `tools_red.py`.) ### What the agent is rewarded for Rewards combine security outcomes (detection, containment, recovery, exfiltration pressure) and mission outcomes (instruction progress, checkpoints, violations). ### Formal game model CyberSelfPlay is modeled as a two-player partially observable stochastic game (POSG): $$ \mathcal{G}=\langle \mathcal{S},\mathcal{A}_R,\mathcal{A}_B,\mathcal{O}_R,\mathcal{O}_B,T,Z_R,Z_B,r_R,r_B,\gamma \rangle $$ where: $$ \begin{align*} \mathcal{S} &: \text{hidden environment state space} \\ \mathcal{A}_R, \mathcal{A}_B &: \text{Red and Blue action spaces} \\ \mathcal{O}_R, \mathcal{O}_B &: \text{Red and Blue observation spaces} \\ T &: \text{state-transition kernel} \\ Z_R, Z_B &: \text{observation emission models for each player} \\ r_R, r_B &: \text{Red and Blue reward functions} \\ \gamma &: \text{discount factor} \end{align*} $$ with objective: $$ J_i(\pi_i,\pi_{-i})=\mathbb{E}\left[\sum_{t=0}^{H}\gamma^t r_i\left(s_t,a_t^{R},a_t^{B}\right)\right],\quad i\in\{R,B\} $$ with: $$ \begin{aligned} \pi_i &: \text{the policy of player } i \text{ and } \pi_{-i} \text{ the opponent policy} \\ H &: \text{the episode horizon (maximum time steps)} \\ s_t &: \text{the state at time } t \\ a_t^{R}, a_t^{B} &: \text{the Red and Blue actions at time } t \\ r_i(\cdot) &: \text{the reward received by player } i \end{aligned} $$ and near-zero-sum coupling: $$ r_B=-r_R-\lambda C_{\mathrm{collateral}}. $$ --- ## Reward Structure **Red reward** $$ \begin{aligned} r_R &= w_1 \mathbb{1}_{\mathrm{foothold}} + w_2 \mathbb{1}_{\mathrm{priv}} + w_3 \mathbb{1}_{\mathrm{lateral}} + w_4 \mathbb{1}_{\mathrm{exfil}} \\ &\quad - w_5 \mathbb{1}_{\mathrm{detect}} + w_6 \mathbb{1}_{\mathrm{plan\_sabotage}} - \eta_R \end{aligned} $$ **Blue reward** $$ \begin{aligned} r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_3 \mathbb{1}_{\mathrm{recover}} - v_4 \mathbb{1}_{\mathrm{exfil}} \\ &\quad + v_5 \mathbb{1}_{\mathrm{instr\_progress}} + v_6 \mathbb{1}_{\mathrm{checkpoint}} - v_7 \mathbb{1}_{\mathrm{instr\_violation}} \\ &\quad + v_8 \rho_{\mathrm{inst}} - \eta_B \end{aligned} $$ The reward rubric is implemented directly in the environment’s scoring logic. --- ## Environment Architecture ## Training Flow --- ## 🚀 Training Approaches in This Project This project explores multiple training strategies for learning robust Blue policies in the CyberSelfPlay environment. We experiment across **SFT + GRPO baselines**, **reward smoothing**, **diversity shaping**, and **league-based RL** where each round still relies on **SFT**-style warm-start and **GRPO** (mini-GRPO per round) in addition to **PFSP / PSRO** opponent scheduling. --- ### 📊 Overview of Training Methods | Method | Description | Colab | Metrics / Curves | |--------|------------|-------|------------------| | **🔹 GRPO (Single-Policy RL)** |||| | **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | | | **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | | | **🔹 League (Multi-Policy RL)** |||| | **League (PFSP)** | SFT, then per-round **mini-GRPO**; **PFSP** weights which Red-style opponent to sample | [Open](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | | | **League (PSRO)** | SFT, then per-round **mini-GRPO**; **PSRO**-style meta-updates on the opponent population | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | | | **League (PFSP + PSRO)** | SFT, then per-round **mini-GRPO**; **PFSP** sampling and **PSRO** replicator updates used together | [Open](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | | --- ## 📐 Mathematical Formulation ### 1. GRPO (Vanilla) $$ \mathcal{L}*{\text{GRPO}} = \mathbb{E}\left[\log \pi*\theta(a_i \mid s),(r_i - \bar{r})\right] $$ $$ \bar{r} = \frac{1}{N}\sum_{i=1}^{N} r_i $$ --- ### 2. GRPO + Regularization (Anti-Collapse) $$ r_i' = r_i - \lambda ,\max\big(0,; p(a_i) - \tau\big) $$ where: $$ \begin{aligned} p(a_i) &= \text{frequency of action in batch} \ \tau &= \text{threshold} \end{aligned} $$ --- ### 3. PFSP (Prioritized Fictitious Self-Play) $$ p_j \propto f(w_j), \qquad f(w) = w(1 - w) $$ where: $$ \begin{aligned} w_j &= \text{win-rate against opponent } j \end{aligned} $$ --- ### 4. PSRO (Policy-Space Response Oracles) $$ p_i' \propto p_i \left(1 + \eta (u_i - \bar{u}) \right) $$ $$ \bar{u} = \sum_i p_i u_i $$ where: $$ \begin{aligned} u_i &= \text{utility of policy } i \\ \eta &= \text{learning rate} \end{aligned} $$ --- ### 5. PFSP + PSRO (Combined) $$ p_j \propto f(w_j), \qquad f(w) = w(1 - w) $$ $$ p_i' \propto p_i \left(1 + \eta (u_i - \bar{u}) \right) $$ Combines opponent sampling (PFSP) with meta-policy updates (PSRO). ## Core Optimization Math ### SFT (Supervised Fine-Tuning) Token-level cross-entropy (negative log-likelihood) on expert trajectories. ### GRPO (Group Relative Policy Optimization) For prompt $x$, sample a group of completions: $$ \{y^{(1)},\ldots,y^{(G)}\} \sim \pi_\theta(\cdot\mid x) $$ Score each completion with reward $R^{(j)}$, compute group-relative advantages, and update policy parameters with optional KL regularization toward a reference policy $\pi_{\text{ref}}$. --- ## Scenario Scale | scenario | turns | instructions | checkpoint stride | | --- | ---: | ---: | ---: | | small | 60 | 40 | 8 | | medium | 100 | 120 | 12 | | large | 180 | 300 | 20 | Instruction progress and violation signals are tracked in environment metadata. --- ## Results Summary Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, the same **SFT** and **GRPO** steps appear inside each **round**, while **PFSP / PSRO / mix** reshapes *which* opponents or archetypes the learner faces; together this yields distinct multi-round learning dynamics and robustness to varied Red behavior. Common result artifacts produced by training include: - consolidated training curves, - step-by-step optimization history, - metrics logs, - per-sample reward traces, - per-step visualization snapshots, - and, for league experiments, combined multi-round trend and meta-state reports. --- ## Why It Matters - **Security operations relevance:** models multi-step defense decisions closer to real incident response. - **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty. - **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes. --- ## Abbreviations | Short form | Full form | | --- | --- | | SFT | Supervised Fine-Tuning | | GRPO | Group Relative Policy Optimization | | TRL | Transformers Reinforcement Learning | | LoRA | Low-Rank Adaptation | | PFSP | Prioritized Fictitious Self-Play | | PSRO | Policy-Space Response Orbit | | POSG | Partially Observable Stochastic Game | | POMDP | Partially Observable Markov Decision Process | | MTTD | Mean Time To Detect | | MTTR | Mean Time To Repair | --- ## Project Structure (high level) ```text cyber_selfplay/ ├── cyber_selfplay_env/ # environment core, simulator, rubrics, metrics ├── server/ # OpenEnv API server ├── train/ │ ├── kaggle_grpo.py │ ├── kaggle_grpo_league.py │ ├── pfsp.py │ └── psro_meta.py └── openenv.yaml ``` --- ## References - Vinyals et al., *Nature* 2019 — AlphaStar / league training - Lanctot et al., *NeurIPS* 2017 — PSRO - Hu et al., *ACM Transactions on Privacy and Security* (TOPS), 2021 — cyber defense POMDP - TTCP CAGE-2 — defender POMDP framing - Hugging Face TRL documentation (`GRPOTrainer`)