Open DeepSeek-V4: Community Reproduction
An open-source, HuggingFace-compatible reproduction of DeepSeek-V4 β a 1.6T parameter Mixture-of-Experts language model with 49B activated parameters and 1M token context length.
Based on the DeepSeek-V4 Technical Report and official inference code.
ποΈ Architecture Overview
DeepSeek-V4 introduces several key innovations over DeepSeek-V3:
1. Multi-head Latent Attention (MLA) with Compressed Sparse Attention
- Low-rank KV compression: Joint KV projection to
head_dim=512(vs V3's separatekv_lora_rank=512+qk_nope_head_dim=128+v_head_dim=128) - Low-rank Q compression:
q_lora_rank=1536with RMSNorm bottleneck - Grouped low-rank O projection:
o_groups=16,o_lora_rank=1024β output is split into groups, each compressed independently - Compressed Sparse Attention (CSA): Sliding window (128 tokens) + learned KV compression with gated pooling (4:1 ratio with overlapping windows)
- Heavily Compressed Attention (HCA): 128:1 compression ratio for global context
- Indexer: Learned top-k selection of compressed KV positions for sparse attention
- Attention sink: Learnable per-head bias to capture global information
2. DeepSeekMoE with Hash Routing
- 384 routed experts, 1 shared expert, top-6 routing
- Hash routing for first 3 layers (deterministic expert assignment by token ID)
- Score-based routing with
sqrtsoftplusactivation (βsoftplus) for remaining layers - Auxiliary-loss-free load balancing via bias correction (
noaux_tc) - SwiGLU with clamping:
swiglu_limit=10.0for numerical stability - FP4 expert weights with E8M0 per-32 block scales
3. Manifold-Constrained Hyper-Connections (mHC)
- Replaces standard residual connections
- Maintains
hc_mult=4copies of hidden state - Pre-connection: Sinkhorn-normalized mixing (20 iterations) reduces 4 copies β 1
- Post-connection: Expands 1 β 4 copies via learned post-weights + combination matrix
- Improves signal propagation stability across 61 layers
4. Multi-Token Prediction (MTP)
n_mtp_layers=1: One additional prediction head- MTP block has its own transformer block + embedding projection
- Shares embedding and head weights with main model
π Model Configurations
| Config | Total Params | Active Params | Layers | Experts | Hidden | Heads |
|---|---|---|---|---|---|---|
| V4-Pro | 1.6T | 49B | 61 | 384 | 7168 | 128 |
| V4-Flash | 284B | 13B | 61 | 128* | 4096* | 64* |
| V4-Nano (ours) | ~523M | ~200M | 7 | 8 | 1024 | 16 |
*V4-Flash specs are estimated from the paper.
π§ Key Differences from DeepSeek-V3
| Feature | DeepSeek-V3 | DeepSeek-V4 |
|---|---|---|
| Attention | MLA with full KV | MLA + Compressed Sparse Attention |
| KV format | kv_lora_rank=512 separate nope/rope/v |
Unified head_dim=512 |
| Output projection | Single wo |
Grouped low-rank wo_a + wo_b |
| Residual connections | Standard | Manifold-Constrained Hyper-Connections |
| Expert scoring | sigmoid |
sqrtsoftplus (βsoftplus) |
| Expert routing | Score-based all layers | Hash routing (first 3) + score-based |
| Expert precision | FP8 | FP4 (with E8M0 scales) |
| Context length | 128K | 1M tokens |
| Sliding window | None | 128 tokens |
| KV compression | None | Learned gated pooling (4:1 and 128:1) |
| Sparse attention | None | Indexer-based top-k selection |
| Optimizer | AdamW | Muon |
| Training tokens | 14.8T | 32T+ |
π Files
open-deepseek-v4/
βββ configuration_deepseek_v4.py # HF-compatible config class
βββ modeling_deepseek_v4.py # Full model implementation (pure PyTorch, no custom kernels)
βββ architecture_analysis.py # Detailed V3βV4 architectural comparison
βββ test_model.py # Comprehensive test suite (12 tests)
βββ train.py # Training script with HF Trainer
βββ configs/
β βββ config_pro.json # 1.6T Pro config (matches official)
β βββ config_nano.json # ~523M Nano config for testing
βββ README.md # This file
π Usage
import json
from configuration_deepseek_v4 import DeepSeekV4Config
from modeling_deepseek_v4 import DeepSeekV4ForCausalLM
# Load nano config for testing
with open("configs/config_nano.json") as f:
config = DeepSeekV4Config(**json.load(f))
model = DeepSeekV4ForCausalLM(config)
# Forward pass
import torch
input_ids = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids)
logits = outputs.logits # [1, 128, vocab_size]
# With labels (training)
labels = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()
β Test Results (Nano config)
All 12 architecture tests pass:
- β Configuration loading
- β RMSNorm
- β RoPE with YaRN scaling
- β KV Compressor (4:1 and 128:1)
- β Hyper-Connections (pre + post + head)
- β MoE Gate (hash + score routing with sqrtsoftplus)
- β Expert (SwiGLU with clamping)
- β Full MoE layer
- β Attention (MLA + Compressed Sparse)
- β Transformer Block (with HC)
- β Full model forward + backward
- β Autoregressive generation
π Training Recipe (from paper)
- Optimizer: Muon (momentum-based, matrix-valued updates)
- Pre-training: 32T+ diverse tokens
- Sequence length: Progressive β starts shorter, extends to 1M
- Precision: FP8 for linear layers (block-wise 128Γ128), FP4 for experts
- Post-training: Two-stage
- Independent domain expert cultivation (SFT + RL with GRPO)
- Unified model consolidation via on-policy distillation
β οΈ Notes
- This is a pure PyTorch implementation β no custom CUDA kernels required
- The official inference uses
tilelangkernels for FP4/FP8 GEMM and sparse attention - For production-scale training, you'll need the custom kernels from the official repo
- The Nano config is designed for architecture validation, not for meaningful language modeling
License
MIT (matching DeepSeek-V4's license)
Citation
@misc{deepseekai2026deepseekv4,
title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
author={DeepSeek-AI},
year={2026},
}
Model tree for VinayHajare/open-deepseek-v4
Base model
deepseek-ai/DeepSeek-V4-Flash