OpenRA-RL: An Open Platform for AI Agents in Real-Time Strategy Games
- Environment: modified OpenRA engine + Python wrapper, 25 Hz game loop, 9-channel spatial obs, 21 action types
- LLM-friendly: 50 actions exposed as MCP tools; works with Claude Desktop, OpenRouter, Ollama, LM Studio out of the box
- Async by design: bounded
DropOldestchannels mean a 2-second-per-step LLM and a 40 ms scripted bot drive the same engine without changes - Scales: 64 concurrent sessions in one .NET process — 40× faster reset and 7× less RAM than the per-process v1
- OpenEnv-native: drop the env into TRL/torchforge/Unsloth and run GRPO without environment-specific adapters
- Honest baseline: Qwen3 32B against the Beginner AI scores 0.58–0.80 on economy and 0.0 on combat across 5 episodes — exactly the kind of headroom an RL testbed needs
Resources — 📄 PDF paper · 💻 Environment (GitHub) · 🧑💻 Training repo · 🎬 Demo video · 🤗 OpenEnv Hub
🎮Click above to watch the AI agent play Red Alert
Why another RTS platform?
Real-time strategy games have driven landmark AI achievements — DeepMind's AlphaStar for StarCraft II, OpenAI Five for Dota 2, earlier work on StarCraft. These results were impressive but built on bespoke neural-network architectures, imitation learning from human replays, and distributed RL across thousands of TPUs. The infrastructure does not generalize.
Meanwhile, LLM agents have become a credible general-purpose paradigm — pretrained world knowledge, natural-language reasoning, high-level semantic actions. Web navigation, code generation, tool use. The natural next question: can a frontier LLM, without any RTS-specific training, hold its own in a real-time strategy game?
The honest answer is: nobody knows yet, because no existing RTS platform actually supports LLM agents. Current platforms assume agents that act at millisecond timescales with low-level action spaces. LLM agents need the opposite — high-level interfaces, asynchronous interaction, and tolerance for variable inference latency that swings from 40 ms to multiple seconds. Trying to bolt an LLM onto SC2LE or PySC2 is possible but ad-hoc, and the resulting baselines are not comparable across papers.
OpenRA-RL is our attempt to close that gap. We picked the classic Westwood RTS Red Alert (open-sourced by the OpenRA project) because it has rich strategic depth, a clean codebase we could modify, and a built-in AI ladder for opponents. The result is a platform that lets you point a Qwen3, Claude, or scripted Python bot at the same environment with no scaffolding changes.
Architecture at a glance
OpenRA-RL has three layers: a modified OpenRA engine in C# that ticks the game at ~25 Hz, a gRPC bridge that streams observations and accepts commands, and a Python wrapper that exposes a Gymnasium-style reset / step / close interface via FastAPI. On top of that, an MCP server exposes 50 game actions as tools, so any MCP-compatible LLM client can drive a game.
Three layers: the LLM agent talks to an MCP server, which routes to a Python backend, which talks gRPC to the C# game engine. The same Python env is also a plain OpenEnv environment, so a TRL trainer can drive it without going through MCP at all.
The point of the layering is that agent computation is fully decoupled from game execution. A scripted bot at 40 ms/step and an LLM at 2 s/step both interact with the same 25 Hz engine without disrupting game flow.
The async problem (and how we solved it)
A 25 Hz game engine produces an observation every ~40 ms. A single LLM step can take 2 seconds or more. A naïve "step the env, wait for the agent, step the env" loop falls over: either the game stalls waiting for the agent (no longer real-time), or the agent gets buried under thousands of stale observations.
We used .NET System.Threading.Channels with bounded, non-blocking semantics:
Observation channel (game → agent) — BoundedChannel<GameObservation> with capacity 1 and DropOldest. Every tick the engine writes the latest world state; the channel silently overwrites whatever the agent hasn't read yet. The agent therefore always reads the most recent state, never a stale queue. A fast agent (~40 ms) loses nothing; a slow agent (~2 s) skips ~50 intermediate ticks but still acts on fresh data.
Action channel (agent → game) — BoundedChannel<AgentAction> with capacity 16. Agents often emit 5–10 commands in a single batch (train a unit, move a squad, set a rally point); 16 is enough buffer to never block the agent. The asymmetry matters: dropping a stale observation is harmless (a fresher one is coming), but dropping an action means the agent's intent silently disappears.
Non-blocking guarantee — Both channels use TryWrite. The game thread never waits for the agent. If no action has arrived by tick time, the engine proceeds with a no-op. Game progression is fully independent of agent latency, which is what makes it fair to benchmark a 40 ms scripted bot against a 3 s LLM on the same map.
Observation channel = capacity 1, DropOldest (always-fresh). Action channel = capacity 16 (buffers command batches). The two timing examples on the bottom row show how a fast and slow agent both run against the same 25 Hz engine.
64 games in one .NET process
Training and large-scale evaluation need many concurrent games. Our v1 design spawned a separate dotnet OpenRA.dll process per game. At 64 sessions: ~40 GB RAM, 5–15 seconds per reset. Unworkable.
In v2 we moved to a single .NET process hosting up to 64 sessions. The trick is that ModData (unit stats, building attributes, tech trees, map rules) is immutable after init — load it once, share it lock-free across sessions. That alone reclaims ~35 GB. Each session keeps its own World, OrderManager, and BotBridge, isolated from the others.
| Metric | Legacy (v1) | Multi-session (v2) | Improvement |
|---|---|---|---|
| Reset latency | 5–15 s | 256 ms | ~40× |
| RSS (64 sessions) | ~40 GB | ~6 GB | ~7× |
| JIT compilations | 64× | 1× | 64× |
| Active threads | ~200 | ~20 | ~10× |
| Aggregate ticks/sec | ~8 K | ~15 K | ~2× |
One subtle gotcha worth flagging for anyone doing similar work: don't share the .NET ThreadPool between game ticks and gRPC handlers. We did this in an early v2 and saw 0/16 sessions complete — game-tick tasks saturated the pool, gRPC handlers starved, and the platform deadlocked itself. We now run game ticks on a dedicated BlockingCollection<WorkItem> worker pool sized to the CPU count, separate from the gRPC pool. Per-session SemaphoreSlim(1,1) serializes mutations within a world; sessions tick in parallel. When the worker queue fills up, we return gRPC RESOURCE_EXHAUSTED for natural backpressure instead of unbounded queue growth.
v2: one process, shared ModData, dedicated worker pool, per-session semaphore. A single gRPC channel routes by session_id.
v1, for comparison: 64 separate .NET processes, each paying the JIT and ModData tax.
Lifecycle, replays, observability
The environment is an explicit 8-state machine: IDLE → LAUNCHING → LOADING → CONNECTING → STREAMING → PLAYING → GAME_OVER → CLEANUP. On reset() the system walks the chain through process startup and map loading; the Python BridgeClient retries GetState() up to 120 times until the gRPC channel is up. Two explicit error paths (TIMEOUT and CONN_LOST) trigger an immediate abort + cleanup so we never leave resources in a half-broken state. A health-check endpoint independently verifies daemon liveness and gRPC connectivity, and restarts the daemon if either check fails.
Eight states from IDLE to CLEANUP, plus two error transitions and a separate replay-playback path that reuses the same machinery.
Every game is recorded as a deterministic .orarep replay file: orders + random seed, perfectly reproducible tick-by-tick via a ReplayConnection reader. Replays embed the Docker image version that produced them, so playback fidelity survives engine upgrades. Watching is via browser-based noVNC inside the Docker container (openra-rl replay watch) — no local install, no graphics drivers, works from a headless cloud box. Replays double as benchmark evidence: when you upload a result to the OpenRA-Bench leaderboard, the .orarep is attached and anyone can re-verify the game.
Why we built it on OpenEnv
OpenRA-RL ships as a first-class OpenEnv environment. OpenEnv is the emerging PyTorch-native standard for RL environment authoring + distribution: a typed reset / step contract, structured observation/action spaces, and a Hugging Face Hub layer for discovery. Authors publish once, trainers consume anywhere, with no environment-specific glue.
Concretely, this means:
- A researcher running GRPO via TRL points the trainer at the OpenRA-RL OpenEnv ID. No adapter code.
- The same is true for torchforge and Unsloth.
- The Hub publish includes the game-server Docker image, the Python wrapper, and a versioned manifest. An agent built against OpenRA-RL v0.4.2 reproduces exactly when a third party pulls v0.4.2 a year later.
Most existing OpenEnv environments target narrow, short-horizon tasks: code execution, single-turn tool use, small-scale games. OpenRA-RL extends the standard into the long-horizon, adversarial, real-time, combinatorial-action regime with variable agent latency. The async decoupling, the multi-session runner, and the deterministic replay format are reusable design patterns for anyone authoring a similarly complex OpenEnv environment.
Demonstration: Qwen3 32B vs the Beginner AI
To exercise every design surface end-to-end, we ran a Qwen3 32B agent served locally via Ollama against the built-in Beginner AI on a 128×128 Allied map. The agent gets structured observations as tool responses and issues actions through the MCP tool set, including a pre-game planning phase and a post-game reflection step whose extracted lessons are injected into the next episode's system prompt.
Five episodes, two timing regimes — Games 1–2 with a 30-minute limit, Games 3–5 with a 5-minute limit — to show the platform supports variable episode lengths in a single experiment.
| Game | Duration | Ticks | Assets | Bldgs | Army | Explored | Calls |
|---|---|---|---|---|---|---|---|
| 1† | 30:23 | 1621 | $6,600 | 5 | $2,920 | 3.7% | 62 |
| 2† | 30:15 | 1477 | $4,000 | 3 | $2,340 | 2.7% | 81 |
| 3 | 5:01 | 540 | $2,800 | 3 | $640 | 2.7% | 18 |
| 4 | 5:19 | 509 | $2,300 | 2 | $540 | 2.2% | 19 |
| 5 | 5:17 | 621 | $2,800 | 3 | $740 | 2.7% | 21 |
† 30-minute limit. Games 3–5 use 5 minutes. All five episodes ended in a draw at the time limit with zero combat engagement. The agent successfully bootstrapped an economy in every game but never produced an offensive force.
That zero-combat result is the actual interesting finding, and a scalar win/loss metric would have flattened it. Look at the multi-dimensional reward instead:
Left: per-dimension scores across all 5 games. Right: Game 1's skill profile as a radar plot. The agent registers non-trivial scores on economy, infrastructure, and tempo but zero on combat and disruption — a precise failure mode you can target for reward shaping or curriculum design.
The build-order timelines confirm what the reward vector says: Power Plant, Barracks, sometimes Refinery — but unit production lags badly, and no offensive units reach the field before time runs out.
One row per episode, reconstructed from the deterministic .orarep replay files. Note the x-axes are not shared — Games 1–2 span ~1500 ticks (30 min), Games 3–5 span ~600 ticks (5 min).
The tool-call distribution explains why an LLM can play a 25 Hz game at all:
Left: by category. Right: top 10 individual tools. The dominance of advance (~57% of all calls) is the agent compressing idle game time by skipping ticks — exactly what the async architecture is designed for.
And the macro indicators show the agent's economic management is consistent across the two horizon regimes once you normalize for time:
Left: end-of-game asset value (drops in shorter games, as expected). Centre: per-tick-normalized economy reward (0.58–0.80 across all five games). Right: percent of map explored. The economy quality is steady whether the game runs 5 minutes or 30 — the only thing that changes is how much absolute output the agent gets.
To give you a sense of what the agent's decision loop actually looks like, here are the first 10 turns of Game 5 verbatim from the run log:
Turn 1 | get_faction_briefing()
Turn 2 | end_planning_phase(strategy="Deploy MCV, build Power
Plant (300 ticks), then Barracks/War Factory (500/750),
train E1 + 1tnk, attack AI base at (64,64).")
-> tick=2, cash=$5000, units=1, buildings=0
Turn 3 | deploy_unit(unit_id=120)
-> tick=4, cash=$5000, units=1, buildings=0
Turn 4 | build_and_place(building_type="apwr")
Turn 5 | advance(ticks=100)
-> tick=55, cash=$5000, units=0, buildings=1
Turn 6 | build_and_place(building_type="apwr")
Turn 7 | build_and_place(building_type="powr")
-> tick=57, cash=$5000, units=0, buildings=1
Turn 8 | advance(ticks=60)
-> tick=108, cash=$4923, units=0, buildings=1
Turn 9 | advance(ticks=130)
-> tick=159, cash=$4838, units=0, buildings=1
Turn 10 | advance(ticks=80)
-> tick=210, cash=$4753, units=0, buildings=1
You can see the three-phase rhythm clearly: intel + planning, build the economy, then advance to bridge the gap between LLM latency and game speed.
Try it yourself
The full minimal example — instantiate, reset, step, close — using the standard OpenEnv contract:
from openra_env.config import load_config
from openra_env.server.openra_environment import OpenRAEnvironment
from openra_env.models import ActionType, CommandModel, OpenRAAction
# 1. Configure and instantiate the environment.
config = load_config(game={
"grpc_port": 8000,
"map_name": "tank-duel-basic",
"headless": True,
})
env = OpenRAEnvironment(config=config)
# 2. Reset; obs is a structured observation
# (economy, military, unit/building lists, 9-channel spatial map).
obs = env.reset(seed=0)
# 3. Issue a structured action — one or more CommandModel entries
# drawn from 21 ActionType values (MOVE, ATTACK, BUILD, TRAIN, DEPLOY, ...).
action = OpenRAAction(commands=[
CommandModel(action=ActionType.BUILD, item_type="powr"),
])
obs = env.step(action)
# 4. Close — finalizes the .orarep replay file.
env.close()
The same env.step is what gets called whether you're running a scripted bot, an MCP-tool-using LLM agent, or a TRL-driven GRPO training loop. The bridge translates MCP tool calls into the same OpenRAAction shape before forwarding. Both paths share observation and reward — which is what makes the same env usable by an in-the-loop LLM and a weight-updating RL trainer without rewiring.
If you want to run an LLM agent end-to-end, it's a one-liner:
pip install openra-rl
openra-rl play # interactive wizard for OpenRouter / Ollama / LM Studio
For the full install / Docker / training paths, see the GitHub README.
What this baseline actually demonstrates
We do not claim a winning agent. We claim a research testbed, and the five-episode run validates five things about it:
The environment is strategically deep. A frontier LLM playing the simplest opponent went 0–0–5 with zero engagements. That's not a platform failure — it's evidence that even tutorial-tier Red Alert requires real strategic reasoning (build order, army composition, attack timing) that prompt-driven LLMs do not yet capture. The gap is the headroom an RL testbed needs.
The multi-dimensional reward localizes weakness. A win/loss scalar collapses all five games into "draw." The 8-D vector says combat = 0, disruption = 0, economy = 0.58–0.80, infrastructure = high — a concrete target for reward shaping and curriculum work.
The async architecture is load-bearing. 57% of tool calls are
advance. Without DropOldest observations and a non-blocking action channel, a 2-second LLM cannot meaningfully play a 25 Hz game. The async design is what makes the LLM a first-class citizen, not a workaround.In-context reflection helps, but is not enough. Episode 2's reflection diagnoses a build-order mistake ("War Factory before Power Plant"); by Episode 4 the pre-game plan opens with a Power Plant. Prompt-injection learning fixes build order — it does not close the combat gap. That's exactly the kind of environment where the jump from in-context adaptation to weight-updating RL should measurably matter.
It plugs straight into the OpenEnv ecosystem. Every behaviour above (planning, acting, rewarding, reflecting, replaying) is exposed through the standard OpenEnv interface. Pointing TRL / torchforge / Unsloth at the env's Hub identifier requires no environment-side changes.
What's next
We're releasing OpenRA-RL as open-source software and inviting the community to push on it. Concretely interesting next steps from where we stand today:
- GRPO from the Qwen3 baseline — same agent, weight updates instead of prompt injection. Does the combat-zero result actually move?
- Curriculum from the 8-D reward — start agents in scenarios that only require the combat dimension and ladder up.
- Cross-LLM comparison — Claude Sonnet, GPT-class models, smaller locals on the same map / opponent / time limit.
- Agent-vs-agent leaderboard play — submissions go to OpenRA-Bench with replay attached; anyone can re-verify any result.
If you build on top of it, we'd love to hear from you on the GitHub issues.
References
- C. Berner et al. Dota 2 with large scale deep reinforcement learning. arXiv:1912.06680, 2019.
- L. Han et al. TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game. arXiv:2011.13729, 2021.
- D. Lee et al. Modular Architecture for StarCraft II with Deep Reinforcement Learning. AIIDE 2018.
- Meta PyTorch Team. OpenEnv: A Standard for RL Environment Creation and Interoperability. github.com/meta-pytorch/OpenEnv, 2025.
- M. Samvelyan et al. The StarCraft Multi-Agent Challenge. arXiv:1902.04043, 2019.
- Y. Tian et al. ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games. arXiv:1707.01067, 2017.
- O. Vinyals et al. StarCraft II: A New Challenge for Reinforcement Learning. arXiv:1708.04782, 2017.
- L. von Werra et al. TRL: Transformer Reinforcement Learning. github.com/huggingface/trl, 2020.
- X. Wang et al. SCC: an efficient deep RL agent mastering StarCraft II. PMLR v139, 2021.
- S. Yao et al. ReAct: Synergizing reasoning and acting in language models. ICLR 2023.









