🌤️ **Update: 2026/05/09 - Core functionality good - dogfooding/testing/refining. **
Focused on real, long running sessions now. Uploaded fresh GGUF. See punch list below for progress. Bugs w/ agent diagnoses welcome. **
🧪 Experimental llama.cpp fork and GGUFs for DeepSeek-V4-Flash
A stopgap to experiment with DeepSeek-V4-Flash with CUDA and ROCm locally while the tools ecosystem catches up. Expect rough edges. Priority is validating for text and coding coherence.
GGUF files for deepseek-ai/DeepSeek-V4-Flash.
⚠️ You need the custom fork
These GGUFs require a DeepSeek-V4-capable fork of llama.cpp. Vanilla llama.cpp doesn't support this architecture yet.
- llama.cpp fork: ssweens/llama.cpp-deepseek-v4
- Backends: Tested on CPU, CUDA, ROCm and Vulkan.
- Compatability: Also compatible with Antirez's ggufs antirez/deepseek-v4-gguf
Example:
llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2 \
-m /mnt/supmodels/gguf/deepseek-ai__DeepSeek-V4-Flash/deepseek-ai__DeepSeek-V4-Flash-Q4_K_M.gguf -c 65536 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0
Performance
Basic Coding coherence (humaneval_instruct, n=30)
| Model | pass@1 |
|---|---|
| IQ1_M | 1.000±0.000 |
| IQ2_XXS | 1.000±0.000 |
| IQ2_XXS (Antirez) | 1.000±0.000 |
| BF16ish | 1.000±0.000 |
Speed (llama-benchy, defaults)
Note: models up to IQ2_XS are CUDA, pipeline parallel mix of consumer RTX.
IQ2_antirez and bigger are CUDA+ROCm, pipeline parallel mix with Strix Halo mixed in.
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | pp2048 | 358.44 ± 2.05 | 5714.56 ± 32.61 | 5713.91 ± 32.61 | 5714.56 ± 32.61 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | tg32 | 30.62 ± 0.28 | 31.00 ± 0.00 | |||
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | pp32768 | 249.63 ± 0.79 | 131244.94 ± 413.90 | 131244.26 ± 413.90 | 131244.94 ± 413.90 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS | tg32 | 24.54 ± 0.23 | 25.00 ± 0.00 |
ROCm Only
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm | pp2048 | 66.96 ± 1.38 | 30640.40 ± 637.11 | 30597.24 ± 637.11 | 30640.40 ± 637.11 | |
| deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm | tg32 | 8.72 ± 0.09 | 9.00 ± 0.00 |
Punch List
| Status | Feature |
|---|---|
| X | Simple chat |
| X | Basic quants |
| X | iMatrix quants |
| X | Chat template |
| X | Tool calling |
| X | Decent context |
| X | Pipeline parallelism |
| ? | Tensor parallelism |
| X | Prompt caching (no non-contiguous reuse via K-shift) |
| X | CPU |
| X | CUDA |
| X | ROCm |
| ? | Vulkan |
| X | Cross-platform GPU |
| X | antirez/ds4 compat |
| Prefill optimization | |
| ?? | DSv4 Pro compat |
| MTP support |
Original model
Thanks
- antirez — llama.cpp fork for Metal and CUDA in llama.cpp-deepseek-v4-flash and DS4
- ml-explore/mlx-lm #1192 — MLX DSV4
- DeepSeek — open inference code and the technical report
- nisparks et al - some early implementation efforts and discussion
- llama.cpp — the project that makes local LLM inference possible
- Downloads last month
- 1,919
1-bit
2-bit
3-bit
16-bit
Model tree for ssweens/DeepSeek-V4-Flash-GGUF-YMMV
Base model
deepseek-ai/DeepSeek-V4-Flash