PAWN: Playstyle-Agnostic World-model Network for Chess

A small causal transformer trained on random chess games that learns legal moves, board state representations, and game dynamics purely from random legal move sequences absent any form of strategic play.

I've found PAWN to be a viable testbed for finetuning and augmentation methods at small scale. Since it is entirely unopinionated, it's a blank slate ready to be adapted, augmented, and finetuned into arbitrary player models with unique playstyles.

Finetuning PAWN has proven significantly more parameter-efficient than training new models from scratch and requires minimal compute resources.

Feel free to use PAWN in your own experiments. Note that PAWN was developed as a personal project by a single developer and has not been published or audited. If you spot a bug, please help out by creating an issue or PR.

PAWN is under active development and is not yet stable.

Model Variants

Three sizes, trained for 100K steps on random games (~25.6M games each):

Variant d_model Layers Heads Params Top-1 Legal Rate Download
PAWN-Small 256 8 4 ~9.5M 6.73% 99.29% Model on HF
PAWN (Base) 512 8 8 ~35.8M 6.86% 99.97% Model on HF
PAWN-Large 640 10 8 ~68.4M 6.94% 99.98% Model on HF

All variants share the same architecture: RMSNorm, SwiGLU FFN, RoPE, factored move embeddings, and a 4278-token vocabulary covering:

  • all possible (src, dst) pairs for an 8x8 grid (the chess board),
  • promotion moves: 4 piece types (queen, bishop, rook, knight) x 44 eligible (source square, destination square) pairs for pawns reaching the 1st & 8th ranks,
  • a token for each game outcome (WHITE_CHECKMATES, BLACK_CHECKMATES, STALEMATE, DRAW_BY_RULE, PLY_LIMIT),
  • and a padding token.

Notably, the vocabulary includes impossible moves like a1a1 and b1a5. PAWN naturally learns to avoid these since they don't appear in its training examples.

Conceptually, each token is best thought of as a move in UCI notation -- they are effectively coordinates. They do not include any information on the type of piece, side to play, or any direct geometric or board state information other than the factored nature of the embeddings.

For example, e2e4 is the token that represents the king's pawn opening, but only when it's the first ply in the sequence (moving a rook from e2 to e4 in the late game would use the same token). The model learns to track which type of piece is on each square at any given moment entirely of its own accord.

For that matter, it isn't told what piece types exist, what movement patterns they follow, or indeed the concept of a piece. All of that understanding comes purely from observation and can be isolated via linear probes (Alain & Bengio, 2016).

Quickstart

# Clone and build
git clone https://github.com/thomas-schweich/PAWN.git && cd PAWN

# Build the Rust chess engine (required -- handles all game logic)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python dependencies
uv sync --extra cu128   # NVIDIA GPU (or --extra rocm for AMD)

Train an adapter

Weights and data load directly from HuggingFace -- no submodules or local files needed:

uv run python scripts/train_bottleneck.py \
    --checkpoint thomas-schweich/pawn-base \
    --pgn thomas-schweich/pawn-lichess-full \
    --bottleneck-dim 32 --lr 1e-4 --local-checkpoints

Pretrain from scratch

Random games are generated on-the-fly; no dataset required:

uv run python scripts/train.py --variant base --local-checkpoints

# Or train all three variants simultaneously on shared data
uv run python scripts/train_all.py --local-checkpoints

Run probes and diagnostics

uv run python scripts/eval_probes.py --log-dir logs --device cuda
uv run python -m pawn.dashboard --log-dir logs  # real-time monitoring

Datasets

These datasets are for adapter training (behavioral cloning), not for pretraining PAWN itself. PAWN is pretrained exclusively on random legal games generated on-the-fly -- it never sees human or engine games during pretraining. The datasets below provide real gameplay data for finetuning the frozen PAWN backbone into player models that mimic specific playstyles or skill levels.

Dataset Games Description Link
Lichess Full ~289M train + 50K val + 50K test Rated games from Q1 2025 (all Elos), holdout from Jan 2026 pawn-lichess-full
Stockfish nodes=1 900K train + 50K val + 50K test NNUE self-play, 1 node/move stockfish-nodes1

All datasets use the PAWN token format: pre-tokenized list[int16] move sequences, ready for training without any parsing. The Lichess dataset also includes clock annotations, Stockfish eval annotations (~8% of games), player hashes, Elo ratings, and game metadata.

Datasets load directly from HuggingFace via Polars lazy scan -- predicate pushdown on columns like white_elo and date lets you efficiently filter to specific Elo bands or time periods without downloading the full dataset.

Architecture

More info: docs/ARCHITECTURE.md

PAWN is a standard decoder-only transformer trained with next-token prediction on chess move sequences. Each training example is:

[outcome] [ply_1] [ply_2] ... [ply_N] [PAD] ... [PAD]

Ply tokens use a factored embedding: each move is decomposed into source square + destination square + promotion piece, with embeddings summed. This gives the model explicit spatial structure while keeping the vocabulary compact. The context window of all variants is 256 tokens.

The model's predictions are not masked to legal moves during training; it has to determine what moves are currently legal based on the sequence of moves so far.

No attempt is made to provide the model with information about other pieces. In other words, it only thinks in moves. There is no equivalent of the multi-plane 8x8xN board representation used by e.g. AlphaZero (Silver et al., 2018) and Lc0. Any and all state representation and geometry is learned by the model internally.

What the Model Learns

Despite training exclusively on random games, PAWN develops rich internal representations. Linear probes on the base model's hidden states decode:

Probe Accuracy
Side to move 100.0%
En passant square 99.7%
Castling rights 96.6%
Game phase 90.7%
Piece type at square 89.7%
Is check 94.2%
Material count (MAE) 6.1

The model also achieves >99.9% legal move rate on the base and large variants, correctly identifying legal moves from move history alone.

The theoretical accuracy ceiling for random game prediction is 6.43% (unconditional) to 7.92% (MCTS-conditioned on outcome). All three models exceed the unconditional ceiling, confirming they learn structure beyond move legality.

Adapter Methods

More info: docs/ADAPTERS.md

PAWN ships with six adapter implementations for fine-tuning the frozen backbone on human game data:

Method Params (typical) Accuracy (1800 Elo) Description
Bottleneck 131K 41.7% Houlsby-style residual MLP adapters
RoSA configurable -- Gradient-informed sparse + LoRA
Sparse 503K-2.7M 40.2-44.7% Random binary mask on frozen weights
LoRA ~65K 34.1% Low-rank attention projection adapters
Hybrid ~65K 34.1% LoRA + FiLM combined
FiLM ~17K 30.3% Per-channel affine modulation

A 524K bottleneck adapter achieves 42.2% accuracy predicting moves by 1800-rated Lichess players, vs. 30.9% for a standalone model with the same architecture and parameter count -- an ~11 percentage point "free" accuracy lift from the frozen backbone.

Repository Structure

pawn/
β”œβ”€β”€ pawn/                 # Core Python package
β”‚   β”œβ”€β”€ config.py         # Model configs (small/base/large)
β”‚   β”œβ”€β”€ model.py          # PAWN transformer
β”‚   β”œβ”€β”€ data.py           # Random game data pipeline
β”‚   β”œβ”€β”€ lichess_data.py   # Lichess/Parquet data pipeline
β”‚   β”œβ”€β”€ trainer.py        # Pretraining loop
β”‚   β”œβ”€β”€ gpu.py            # GPU auto-detection
β”‚   β”œβ”€β”€ adapters/         # Bottleneck, LoRA, FiLM, sparse, hybrid, RoSA
β”‚   β”œβ”€β”€ eval_suite/       # Probes, generation tests, diagnostics
β”‚   └── dashboard/        # Solara training dashboard
β”œβ”€β”€ engine/               # Rust chess engine (PyO3 bindings via shakmaty)
β”œβ”€β”€ scripts/              # Training, evaluation, and data extraction
β”œβ”€β”€ deploy/               # Docker, RunPod deployment, serverless handler
β”œβ”€β”€ tests/                # Unit tests
└── docs/                 # Architecture, training, adapter docs

Chess Engine

PAWN includes a bundled Rust chess engine (engine/) that handles all game simulation, move generation, legal move computation, tokenization, and PGN parsing. The engine uses shakmaty under the hood, with PyO3 bindings to Python. No Python chess libraries are used.

The engine generates training data on-the-fly via chess_engine.generate_random_games(), producing well over 100 million random games per hour. It also includes enriched PGN parsing (extracting clock annotations, Stockfish evals, and headers in a single pass) and UCI engine self-play generation.

More info

  • Architecture -- model design, embeddings, training objective
  • Training -- pretraining, adapter training, deployment
  • Adapters -- adapter methods, results, quick start
  • Accuracy Ceiling -- theoretical limits for random game prediction

Acknowledgments

PAWN builds on ideas and tools from the following projects and publications:

Citation

@software{schweich2026pawn,
  author = {Schweich, Thomas},
  title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
  year = 2026,
  url = {https://github.com/thomas-schweich/PAWN},
  license = {Apache-2.0}
}

License

Apache 2.0. See LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including thomas-schweich/PAWN

Papers for thomas-schweich/PAWN