🌤️ **Update: 2026/05/09 - Core functionality good - dogfooding/testing/refining. **
Focused on real, long running sessions now. Uploaded fresh GGUF. See punch list below for progress. Bugs w/ agent diagnoses welcome. **

🧪 Experimental llama.cpp fork and GGUFs for DeepSeek-V4-Flash

A stopgap to experiment with DeepSeek-V4-Flash with CUDA and ROCm locally while the tools ecosystem catches up. Expect rough edges. Priority is validating for text and coding coherence.

GGUF files for deepseek-ai/DeepSeek-V4-Flash.

⚠️ You need the custom fork

These GGUFs require a DeepSeek-V4-capable fork of llama.cpp. Vanilla llama.cpp doesn't support this architecture yet.

Example:

llama-server -ngl 99 --no-mmap -fa on -np 1 --reasoning-format auto --jinja --threads 3 -ts 4,4,3 -dev CUDA0,CUDA1,CUDA2 \
-m /mnt/supmodels/gguf/deepseek-ai__DeepSeek-V4-Flash/deepseek-ai__DeepSeek-V4-Flash-Q4_K_M.gguf -c 65536 -b 2048 -ub 512 -ctk q8_0 -ctv q8_0

Performance

Basic Coding coherence (humaneval_instruct, n=30)

Model pass@1
IQ1_M 1.000±0.000
IQ2_XXS 1.000±0.000
IQ2_XXS (Antirez) 1.000±0.000
BF16ish 1.000±0.000

Speed (llama-benchy, defaults)

Note: models up to IQ2_XS are CUDA, pipeline parallel mix of consumer RTX.
IQ2_antirez and bigger are CUDA+ROCm, pipeline parallel mix with Strix Halo mixed in.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS pp2048 358.44 ± 2.05 5714.56 ± 32.61 5713.91 ± 32.61 5714.56 ± 32.61
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS tg32 30.62 ± 0.28 31.00 ± 0.00
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS pp32768 249.63 ± 0.79 131244.94 ± 413.90 131244.26 ± 413.90 131244.94 ± 413.90
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS tg32 24.54 ± 0.23 25.00 ± 0.00
ROCm Only
model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm pp2048 66.96 ± 1.38 30640.40 ± 637.11 30597.24 ± 637.11 30640.40 ± 637.11
deepseek-ai/DeepSeek-V4-Flash-IQ2_XXS-rocm tg32 8.72 ± 0.09 9.00 ± 0.00

Punch List

Status Feature
X Simple chat
X Basic quants
X iMatrix quants
X Chat template
X Tool calling
X Decent context
X Pipeline parallelism
? Tensor parallelism
X Prompt caching (no non-contiguous reuse via K-shift)
X CPU
X CUDA
X ROCm
? Vulkan
X Cross-platform GPU
X antirez/ds4 compat
Prefill optimization
?? DSv4 Pro compat
MTP support

Original model

Thanks

Downloads last month
1,919
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

1-bit

2-bit

3-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ssweens/DeepSeek-V4-Flash-GGUF-YMMV

Quantized
(33)
this model