Open DeepSeek-V4: Community Reproduction

An open-source, HuggingFace-compatible reproduction of DeepSeek-V4 β€” a 1.6T parameter Mixture-of-Experts language model with 49B activated parameters and 1M token context length.

Based on the DeepSeek-V4 Technical Report and official inference code.

πŸ—οΈ Architecture Overview

DeepSeek-V4 introduces several key innovations over DeepSeek-V3:

1. Multi-head Latent Attention (MLA) with Compressed Sparse Attention

  • Low-rank KV compression: Joint KV projection to head_dim=512 (vs V3's separate kv_lora_rank=512 + qk_nope_head_dim=128 + v_head_dim=128)
  • Low-rank Q compression: q_lora_rank=1536 with RMSNorm bottleneck
  • Grouped low-rank O projection: o_groups=16, o_lora_rank=1024 β€” output is split into groups, each compressed independently
  • Compressed Sparse Attention (CSA): Sliding window (128 tokens) + learned KV compression with gated pooling (4:1 ratio with overlapping windows)
  • Heavily Compressed Attention (HCA): 128:1 compression ratio for global context
  • Indexer: Learned top-k selection of compressed KV positions for sparse attention
  • Attention sink: Learnable per-head bias to capture global information

2. DeepSeekMoE with Hash Routing

  • 384 routed experts, 1 shared expert, top-6 routing
  • Hash routing for first 3 layers (deterministic expert assignment by token ID)
  • Score-based routing with sqrtsoftplus activation (√softplus) for remaining layers
  • Auxiliary-loss-free load balancing via bias correction (noaux_tc)
  • SwiGLU with clamping: swiglu_limit=10.0 for numerical stability
  • FP4 expert weights with E8M0 per-32 block scales

3. Manifold-Constrained Hyper-Connections (mHC)

  • Replaces standard residual connections
  • Maintains hc_mult=4 copies of hidden state
  • Pre-connection: Sinkhorn-normalized mixing (20 iterations) reduces 4 copies β†’ 1
  • Post-connection: Expands 1 β†’ 4 copies via learned post-weights + combination matrix
  • Improves signal propagation stability across 61 layers

4. Multi-Token Prediction (MTP)

  • n_mtp_layers=1: One additional prediction head
  • MTP block has its own transformer block + embedding projection
  • Shares embedding and head weights with main model

πŸ“Š Model Configurations

Config Total Params Active Params Layers Experts Hidden Heads
V4-Pro 1.6T 49B 61 384 7168 128
V4-Flash 284B 13B 61 128* 4096* 64*
V4-Nano (ours) ~523M ~200M 7 8 1024 16

*V4-Flash specs are estimated from the paper.

πŸ”§ Key Differences from DeepSeek-V3

Feature DeepSeek-V3 DeepSeek-V4
Attention MLA with full KV MLA + Compressed Sparse Attention
KV format kv_lora_rank=512 separate nope/rope/v Unified head_dim=512
Output projection Single wo Grouped low-rank wo_a + wo_b
Residual connections Standard Manifold-Constrained Hyper-Connections
Expert scoring sigmoid sqrtsoftplus (√softplus)
Expert routing Score-based all layers Hash routing (first 3) + score-based
Expert precision FP8 FP4 (with E8M0 scales)
Context length 128K 1M tokens
Sliding window None 128 tokens
KV compression None Learned gated pooling (4:1 and 128:1)
Sparse attention None Indexer-based top-k selection
Optimizer AdamW Muon
Training tokens 14.8T 32T+

πŸ“ Files

open-deepseek-v4/
β”œβ”€β”€ configuration_deepseek_v4.py   # HF-compatible config class
β”œβ”€β”€ modeling_deepseek_v4.py        # Full model implementation (pure PyTorch, no custom kernels)
β”œβ”€β”€ architecture_analysis.py       # Detailed V3β†’V4 architectural comparison
β”œβ”€β”€ test_model.py                  # Comprehensive test suite (12 tests)
β”œβ”€β”€ train.py                       # Training script with HF Trainer
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ config_pro.json            # 1.6T Pro config (matches official)
β”‚   └── config_nano.json           # ~523M Nano config for testing
└── README.md                      # This file

πŸš€ Usage

import json
from configuration_deepseek_v4 import DeepSeekV4Config
from modeling_deepseek_v4 import DeepSeekV4ForCausalLM

# Load nano config for testing
with open("configs/config_nano.json") as f:
    config = DeepSeekV4Config(**json.load(f))

model = DeepSeekV4ForCausalLM(config)

# Forward pass
import torch
input_ids = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids)
logits = outputs.logits  # [1, 128, vocab_size]

# With labels (training)
labels = torch.randint(0, config.vocab_size, (1, 128))
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()

βœ… Test Results (Nano config)

All 12 architecture tests pass:

  1. βœ… Configuration loading
  2. βœ… RMSNorm
  3. βœ… RoPE with YaRN scaling
  4. βœ… KV Compressor (4:1 and 128:1)
  5. βœ… Hyper-Connections (pre + post + head)
  6. βœ… MoE Gate (hash + score routing with sqrtsoftplus)
  7. βœ… Expert (SwiGLU with clamping)
  8. βœ… Full MoE layer
  9. βœ… Attention (MLA + Compressed Sparse)
  10. βœ… Transformer Block (with HC)
  11. βœ… Full model forward + backward
  12. βœ… Autoregressive generation

πŸ“ Training Recipe (from paper)

  • Optimizer: Muon (momentum-based, matrix-valued updates)
  • Pre-training: 32T+ diverse tokens
  • Sequence length: Progressive β€” starts shorter, extends to 1M
  • Precision: FP8 for linear layers (block-wise 128Γ—128), FP4 for experts
  • Post-training: Two-stage
    1. Independent domain expert cultivation (SFT + RL with GRPO)
    2. Unified model consolidation via on-policy distillation

⚠️ Notes

  • This is a pure PyTorch implementation β€” no custom CUDA kernels required
  • The official inference uses tilelang kernels for FP4/FP8 GEMM and sparse attention
  • For production-scale training, you'll need the custom kernels from the official repo
  • The Nano config is designed for architecture validation, not for meaningful language modeling

License

MIT (matching DeepSeek-V4's license)

Citation

@misc{deepseekai2026deepseekv4,
    title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
    author={DeepSeek-AI},
    year={2026},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for VinayHajare/open-deepseek-v4

Finetuned
(7)
this model