cyberselfplay-env / README.md
HarshitShri026's picture
Update README.md
04242b9 verified
metadata
title: CyberSelfPlay (Cyber POSG)
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: red
sdk: docker
app_port: 7870
pinned: true

CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment

Training Script Link: League (PFSP + PSRO) β€” Colab (mixed)

An interactive Game based on Environment: Game

CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.

Environment on Hugging Face Space


Problem and Capability Gap

Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.

Connection to long-horizon and self-play themes: the setting stresses (super) long-horizon planning and instruction followingβ€”episodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports self-improvement through interaction: the non-league recipes use SFT followed by GRPO; league methods combine the same SFT initialization and per-round mini-GRPO steps with PFSP / PSRO / mix updates over Red archetypes or pools. Opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving curriculum over the same family of tasks.


Environment Design

Two agents interact in a shared hidden state: Blue (defender) is the trainable side in most recipes; Red (attacker) can be scripted, drawn from a pool, or used as a league opponent. Time advances in discrete steps: each step takes one CyberAction and returns a player-specific CyberObservation (then you alternate or follow your rollout script). The OpenEnv server exposes the same CyberAction / CyberObservation contract over HTTP for remote rollouts and demos.

What the agent observes

CyberObservation is built in cyber_selfplay_env/models.py and returned by CyberSelfPlayEnvironment.reset and step. It includes:

  • public_state: a partial dict from CyberSimulator.visible_state(actor). For Blue, expect fields such as time_step, a window of detections, business_impact, a coarse known_incident_count, and instruction_progress with counts of completed, violated, and total mission instructions. Red’s public view is different (e.g., limited known_targets, high_value_guess_count, detection_pressure) so the game is a true two-sided POSG with distinct observation channels.
  • telemetry: a short list of event-like records (e.g., recent detections for Blue; a compact risk string for Red).
  • incident_summary: episode-level fields including terminated, winner when set, exfil and time_step (exact keys evolve with simulator state but stay consistent in the client).
  • reward: scalar reward from the environment for the last transition.
  • done: whether the episode has ended.
  • metadata: on normal steps, includes reward_components, raw events, posg_metrics (aggregates like exfil and instruction completion rates), and curriculum block (scenario name, rolling Blue win rate, episode index). The initial reset also carries scenario/actor hints. Invalid tool calls return metadata["error"] without terminating the run.

What the agent does

Policies act through CyberAction: actor ("red" | "blue"), tool_name, optional target (host/asset id), params (tool-specific dict, e.g. for execute_instruction), and optional rationale. The environment validates the tool name against the allowed set for that side, then the CyberSimulator applies the effect.

Blue tools (defense and playbook): e.g. query_siem, triage_alerts, isolate_host, disable_account, rotate_secrets, deploy_patch, harden_policy, restore_backup, run_forensics, publish_ioc_blocklist, execute_instruction, checkpoint_plan, reconcile_state.
Red tools (attack chain): e.g. recon_network, enumerate_services, attempt_exploit, dump_credentials, pivot_host, establish_persistence, prepare_exfiltration, execute_exfiltration, cover_tracks, sabotage_recovery_plan.
(Authoritative sets live in cyber_selfplay_env/tools_blue.py and tools_red.py.)

What the agent is rewarded for

Rewards combine security outcomes (detection, containment, recovery, exfiltration pressure) and mission outcomes (instruction progress, checkpoints, violations).

Formal game model

CyberSelfPlay is modeled as a two-player partially observable stochastic game (POSG):

G=⟨S,AR,AB,OR,OB,T,ZR,ZB,rR,rB,γ⟩ \mathcal{G}=\langle \mathcal{S},\mathcal{A}_R,\mathcal{A}_B,\mathcal{O}_R,\mathcal{O}_B,T,Z_R,Z_B,r_R,r_B,\gamma \rangle

where:

S:hidden environment state spaceAR,AB:Red and Blue action spacesOR,OB:Red and Blue observation spacesT:state-transition kernelZR,ZB:observation emission models for each playerrR,rB:Red and Blue reward functionsΞ³:discount factor \begin{align*} \mathcal{S} &: \text{hidden environment state space} \\ \mathcal{A}_R, \mathcal{A}_B &: \text{Red and Blue action spaces} \\ \mathcal{O}_R, \mathcal{O}_B &: \text{Red and Blue observation spaces} \\ T &: \text{state-transition kernel} \\ Z_R, Z_B &: \text{observation emission models for each player} \\ r_R, r_B &: \text{Red and Blue reward functions} \\ \gamma &: \text{discount factor} \end{align*}

with objective:

Ji(Ο€i,Ο€βˆ’i)=E[βˆ‘t=0HΞ³tri(st,atR,atB)],i∈{R,B} J_i(\pi_i,\pi_{-i})=\mathbb{E}\left[\sum_{t=0}^{H}\gamma^t r_i\left(s_t,a_t^{R},a_t^{B}\right)\right],\quad i\in\{R,B\}

with:

Ο€i:the policy of player i and Ο€βˆ’i the opponent policyH:the episode horizon (maximum time steps)st:the state at time tatR,atB:the Red and Blue actions at time tri(β‹…):the reward received by player i \begin{aligned} \pi_i &: \text{the policy of player } i \text{ and } \pi_{-i} \text{ the opponent policy} \\ H &: \text{the episode horizon (maximum time steps)} \\ s_t &: \text{the state at time } t \\ a_t^{R}, a_t^{B} &: \text{the Red and Blue actions at time } t \\ r_i(\cdot) &: \text{the reward received by player } i \end{aligned}

and near-zero-sum coupling:

rB=βˆ’rRβˆ’Ξ»Ccollateral. r_B=-r_R-\lambda C_{\mathrm{collateral}}.


Reward Structure

Red reward

rR=w11foothold+w21priv+w31lateral+w41exfilβˆ’w51detect+w61plan_sabotageβˆ’Ξ·R \begin{aligned} r_R &= w_1 \mathbb{1}_{\mathrm{foothold}} + w_2 \mathbb{1}_{\mathrm{priv}} + w_3 \mathbb{1}_{\mathrm{lateral}} + w_4 \mathbb{1}_{\mathrm{exfil}} \\ &\quad - w_5 \mathbb{1}_{\mathrm{detect}} + w_6 \mathbb{1}_{\mathrm{plan\_sabotage}} - \eta_R \end{aligned}

Blue reward

rB=v11detect+v21contain+v31recoverβˆ’v41exfil+v51instr_progress+v61checkpointβˆ’v71instr_violation+v8ρinstβˆ’Ξ·B \begin{aligned} r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_3 \mathbb{1}_{\mathrm{recover}} - v_4 \mathbb{1}_{\mathrm{exfil}} \\ &\quad + v_5 \mathbb{1}_{\mathrm{instr\_progress}} + v_6 \mathbb{1}_{\mathrm{checkpoint}} - v_7 \mathbb{1}_{\mathrm{instr\_violation}} \\ &\quad + v_8 \rho_{\mathrm{inst}} - \eta_B \end{aligned}

The reward rubric is implemented directly in the environment’s scoring logic.


Environment Architecture

Training Flow


πŸš€ Training Approaches in This Project

This project explores multiple training strategies for learning robust Blue policies in the CyberSelfPlay environment.
We experiment across SFT + GRPO baselines, reward smoothing, diversity shaping, and league-based RL where each round still relies on SFT-style warm-start and GRPO (mini-GRPO per round) in addition to PFSP / PSRO opponent scheduling.


πŸ“Š Overview of Training Methods

Method Description Colab Metrics / Curves
πŸ”Ή GRPO (Single-Policy RL)
SFT β†’ GRPO (Vanilla) Baseline using only environment reward Open
SFT β†’ GRPO (Anti-Collapse) Adds diversity penalty to avoid mode collapse Open
πŸ”Ή League (Multi-Policy RL)
League (PFSP) SFT, then per-round mini-GRPO; PFSP weights which Red-style opponent to sample Open
League (PSRO) SFT, then per-round mini-GRPO; PSRO-style meta-updates on the opponent population Open
League (PFSP + PSRO) SFT, then per-round mini-GRPO; PFSP sampling and PSRO replicator updates used together Open

πŸ“ Mathematical Formulation

1. GRPO (Vanilla)

Lβˆ—GRPO=E[logβ‘Ο€βˆ—ΞΈ(ai∣s),(riβˆ’rΛ‰)] \mathcal{L}*{\text{GRPO}} = \mathbb{E}\left[\log \pi*\theta(a_i \mid s),(r_i - \bar{r})\right]

rΛ‰=1Nβˆ‘i=1Nri \bar{r} = \frac{1}{N}\sum_{i=1}^{N} r_i


2. GRPO + Regularization (Anti-Collapse)

riβ€²=riβˆ’Ξ»,max⁑(0,;p(ai)βˆ’Ο„) r_i' = r_i - \lambda ,\max\big(0,; p(a_i) - \tau\big)

where: p(ai)=frequency of action in batch Ο„=threshold \begin{aligned} p(a_i) &= \text{frequency of action in batch} \ \tau &= \text{threshold} \end{aligned}


3. PFSP (Prioritized Fictitious Self-Play)

pj∝f(wj),f(w)=w(1βˆ’w) p_j \propto f(w_j), \qquad f(w) = w(1 - w)

where: wj=win-rate against opponent j \begin{aligned} w_j &= \text{win-rate against opponent } j \end{aligned}


4. PSRO (Policy-Space Response Oracles)

piβ€²βˆpi(1+Ξ·(uiβˆ’uΛ‰)) p_i' \propto p_i \left(1 + \eta (u_i - \bar{u}) \right)

uΛ‰=βˆ‘ipiui \bar{u} = \sum_i p_i u_i

where: ui=utility of policy iΞ·=learning rate \begin{aligned} u_i &= \text{utility of policy } i \\ \eta &= \text{learning rate} \end{aligned}


5. PFSP + PSRO (Combined)

pj∝f(wj),f(w)=w(1βˆ’w) p_j \propto f(w_j), \qquad f(w) = w(1 - w)

piβ€²βˆpi(1+Ξ·(uiβˆ’uΛ‰)) p_i' \propto p_i \left(1 + \eta (u_i - \bar{u}) \right)

Combines opponent sampling (PFSP) with meta-policy updates (PSRO).

Core Optimization Math

SFT (Supervised Fine-Tuning)

Token-level cross-entropy (negative log-likelihood) on expert trajectories.

GRPO (Group Relative Policy Optimization)

For prompt $x$, sample a group of completions:

{y(1),…,y(G)}βˆΌΟ€ΞΈ(β‹…βˆ£x) \{y^{(1)},\ldots,y^{(G)}\} \sim \pi_\theta(\cdot\mid x)

Score each completion with reward $R^{(j)}$, compute group-relative advantages, and update policy parameters with optional KL regularization toward a reference policy $\pi_{\text{ref}}$.


Scenario Scale

scenario turns instructions checkpoint stride
small 60 40 8
medium 100 120 12
large 180 300 20

Instruction progress and violation signals are tracked in environment metadata.


Results Summary

Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, the same SFT and GRPO steps appear inside each round, while PFSP / PSRO / mix reshapes which opponents or archetypes the learner faces; together this yields distinct multi-round learning dynamics and robustness to varied Red behavior.

Common result artifacts produced by training include:

  • consolidated training curves,
  • step-by-step optimization history,
  • metrics logs,
  • per-sample reward traces,
  • per-step visualization snapshots,
  • and, for league experiments, combined multi-round trend and meta-state reports.

Why It Matters

  • Security operations relevance: models multi-step defense decisions closer to real incident response.
  • Research relevance: provides a reproducible adversarial benchmark for instruction-following under uncertainty.
  • Evaluation relevance: combines environment dynamics, tool-structured actions, and measurable outcomes.

Abbreviations

Short form Full form
SFT Supervised Fine-Tuning
GRPO Group Relative Policy Optimization
TRL Transformers Reinforcement Learning
LoRA Low-Rank Adaptation
PFSP Prioritized Fictitious Self-Play
PSRO Policy-Space Response Orbit
POSG Partially Observable Stochastic Game
POMDP Partially Observable Markov Decision Process
MTTD Mean Time To Detect
MTTR Mean Time To Repair

Project Structure (high level)

cyber_selfplay/
β”œβ”€β”€ cyber_selfplay_env/       # environment core, simulator, rubrics, metrics
β”œβ”€β”€ server/                   # OpenEnv API server
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ kaggle_grpo.py
β”‚   β”œβ”€β”€ kaggle_grpo_league.py
β”‚   β”œβ”€β”€ pfsp.py
β”‚   └── psro_meta.py
└── openenv.yaml

References

  • Vinyals et al., Nature 2019 β€” AlphaStar / league training
  • Lanctot et al., NeurIPS 2017 β€” PSRO
  • Hu et al., ACM Transactions on Privacy and Security (TOPS), 2021 β€” cyber defense POMDP
  • TTCP CAGE-2 β€” defender POMDP framing
  • Hugging Face TRL documentation (GRPOTrainer)