Open DeepSeek-V4: Community Reproduction

An open-source, HuggingFace-compatible reproduction of DeepSeek-V4 — a 1.6T parameter Mixture-of-Experts language model with 49B activated parameters and 1M token context length.

Based on the DeepSeek-V4 Technical Report and official inference code.

🏗️ Architecture Overview

DeepSeek-V4 introduces several key innovations over DeepSeek-V3:

1. Multi-head Latent Attention (MLA) with Compressed Sparse Attention

Low-rank KV compression: Joint KV projection to head_dim=512 (vs V3's separate kv_lora_rank=512 + qk_nope_head_dim=128 + v_head_dim=128)
Low-rank Q compression: q_lora_rank=1536 with RMSNorm bottleneck
Grouped low-rank O projection: o_groups=16, o_lora_rank=1024 — output is split into groups, each compressed independently
Compressed Sparse Attention (CSA): Sliding window (128 tokens) + learned KV compression with gated pooling (4:1 ratio with overlapping windows)
Heavily Compressed Attention (HCA): 128:1 compression ratio for global context
Indexer: Learned top-k selection of compressed KV positions for sparse attention
Attention sink: Learnable per-head bias to capture global information

2. DeepSeekMoE with Hash Routing

384 routed experts, 1 shared expert, top-6 routing
Hash routing for first 3 layers (deterministic expert assignment by token ID)
Score-based routing with sqrtsoftplus activation (√softplus) for remaining layers
Auxiliary-loss-free load balancing via bias correction (noaux_tc)
SwiGLU with clamping: swiglu_limit=10.0 for numerical stability
FP4 expert weights with E8M0 per-32 block scales

3. Manifold-Constrained Hyper-Connections (mHC)

Replaces standard residual connections
Maintains hc_mult=4 copies of hidden state
Pre-connection: Sinkhorn-normalized mixing (20 iterations) reduces 4 copies → 1
Post-connection: Expands 1 → 4 copies via learned post-weights + combination matrix
Improves signal propagation stability across 61 layers

4. Multi-Token Prediction (MTP)

n_mtp_layers=1: One additional prediction head
MTP block has its own transformer block + embedding projection
Shares embedding and head weights with main model

📊 Model Configurations

Config	Total Params	Active Params	Layers	Experts	Hidden	Heads
V4-Pro	1.6T	49B	61	384	7168	128
V4-Flash	284B	13B	61	128*	4096*	64*
V4-Nano (ours)	~523M	~200M	7	8	1024	16

*V4-Flash specs are estimated from the paper.

🔧 Key Differences from DeepSeek-V3

Feature	DeepSeek-V3	DeepSeek-V4
Attention	MLA with full KV	MLA + Compressed Sparse Attention
KV format	`kv_lora_rank=512` separate nope/rope/v	Unified `head_dim=512`
Output projection	Single `wo`	Grouped low-rank `wo_a` + `wo_b`
Residual connections	Standard	Manifold-Constrained Hyper-Connections
Expert scoring	`sigmoid`	`sqrtsoftplus` (√softplus)
Expert routing	Score-based all layers	Hash routing (first 3) + score-based
Expert precision	FP8	FP4 (with E8M0 scales)
Context length	128K	1M tokens
Sliding window	None	128 tokens
KV compression	None	Learned gated pooling (4:1 and 128:1)
Sparse attention	None	Indexer-based top-k selection
Optimizer	AdamW	Muon
Training tokens	14.8T	32T+

📁 Files

open-deepseek-v4/
├── configuration_deepseek_v4.py   # HF-compatible config class
├── modeling_deepseek_v4.py        # Full model implementation (pure PyTorch, no custom kernels)
├── architecture_analysis.py       # Detailed V3→V4 architectural comparison
├── test_model.py                  # Comprehensive test suite (12 tests)
├── train.py                       # Training script with HF Trainer
├── configs/
│   ├── config_pro.json            # 1.6T Pro config (matches official)
│   └── config_nano.json           # ~523M Nano config for testing
└── README.md                      # This file

🚀 Usage

import json
from configuration_deepseek_v4 import DeepSeekV4Config
from modeling_deepseek_v4 import DeepSeekV4ForCausalLM

# Load nano config for testing
with open("configs/config_nano.json") as f:
    config = DeepSeekV4Config(**json.load(f))

model = DeepSeekV4ForCausalLM(config)

# Forward pass
import torch
input_ids = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids)
logits = outputs.logits  # [1, 128, vocab_size]

# With labels (training)
labels = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()

✅ Test Results (Nano config)

All 12 architecture tests pass:

✅ Configuration loading
✅ RMSNorm
✅ RoPE with YaRN scaling
✅ KV Compressor (4:1 and 128:1)
✅ Hyper-Connections (pre + post + head)
✅ MoE Gate (hash + score routing with sqrtsoftplus)
✅ Expert (SwiGLU with clamping)
✅ Full MoE layer
✅ Attention (MLA + Compressed Sparse)
✅ Transformer Block (with HC)
✅ Full model forward + backward
✅ Autoregressive generation

📝 Training Recipe (from paper)

Optimizer: Muon (momentum-based, matrix-valued updates)
Pre-training: 32T+ diverse tokens
Sequence length: Progressive — starts shorter, extends to 1M
Precision: FP8 for linear layers (block-wise 128×128), FP4 for experts
Post-training: Two-stage
1. Independent domain expert cultivation (SFT + RL with GRPO)
2. Unified model consolidation via on-policy distillation

⚠️ Notes

This is a pure PyTorch implementation — no custom CUDA kernels required
The official inference uses tilelang kernels for FP4/FP8 GEMM and sparse attention
For production-scale training, you'll need the custom kernels from the official repo
The Nano config is designed for architecture validation, not for meaningful language modeling

License

MIT (matching DeepSeek-V4's license)

Citation

@misc{deepseekai2026deepseekv4,
    title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
    author={DeepSeek-AI},
    year={2026},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for VinayHajare/open-deepseek-v4

Base model

deepseek-ai/DeepSeek-V4-Flash

Finetuned

(7)

this model