DeepSeek-V4-Flash · Abliterated · GGUF
CyberNeurova research — cyberneurova.ai Release: v2 (three-direction multi-turn-aware ablation, 1338-prompt capture corpus)
A permanently-abliterated version of
deepseek-ai/DeepSeek-V4-Flash
(284 B FP8 MoE) packaged as GGUF for llama.cpp. The abliteration is baked
into the weights at conversion time — no runtime hooks, no slowdown, no
reliance on the inference framework supporting custom code paths.
Status: experimental research artifact. Built with antirez's llama.cpp DeepSeek-V4-Flash fork, which is itself experimental. Use at your own discretion.
What's new in v2
| v1 | v2 | |
|---|---|---|
| Direction stack | 2 | 3 (added a residual-targeting direction) |
| Capture corpus | 33 prompts | 1338 prompts (AdvBench + JBB + HarmfulQA + SafeRLHF + MaliciousInstruct + bundled) |
| Refusal rate (8-bench safety) | 0.0% | 0.0% |
| Refusal rate (55-prompt OOD probe) | not measured | 3.6% (baseline: 81.8%) |
| Tool-calling format compliance | 74.2% | 99.2% |
| Bug-finding | 78.3% | 85.0% |
| Hacking compliance | 88.7% | 90.0% |
| Cyber-weapons compliance | 87.3% | 90.0% |
| Coding / coherence / reasoning | unchanged | unchanged |
Plain English: v2 strictly dominates v1 on every dimension we measure. The big-ticket fixes are the OOD soft-refusal failure mode (which the v1 release was reported on) and tool-calling JSON correctness (which v1 broke ~25% of the time).
Detailed numbers: see cyberneurova-ablated-deepseek-flash-v4.pdf and
cyberneurova-deepseek-v4-flash-abliteration-v2.html in this repo.
Variants — pick one
| Variant | File size | RAM floor | Best for |
|---|---|---|---|
| Q2_K | 98.8 GB | 128 GB | The default. Practical for most workstations / M-series Macs. |
| Q8_0 | ~282 GB | 320 GB | Reference / maximum quality. Use this for evaluation, research, or when comparing against the bf16 model. |
Routed-expert weights are quantised at the listed level; embed, head,
attention, and shared-expert paths are always Q8_0 in both variants. The
abliteration directions are baked into all paths.
Note: antirez's V4-Flash converter currently supports only
q8_0,q2_k,iq2_xxs,iq2_xs,tq1_0,tq2_0for routed-expert weights. Q4_K_M / Q5_K_M / Q6_K are not available for V4-Flash GGUFs at the time of this release — the architecture's FP8 expert layout is the limiting factor, not us.
How to download
Q2_K (recommended)
hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
--local-dir .
Q8_0 (max quality)
hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
--local-dir .
Either variant will be ~98 GB or ~282 GB respectively over the wire. The HF hub uses LFS —
hf downloadresumes interrupted transfers automatically.
How to run
Step 1: Build antirez's V4-Flash-aware llama.cpp fork
V4-Flash is not in upstream llama.cpp yet. You need antirez's fork.
git clone https://github.com/antirez/llama.cpp-deepseek-v4-flash.git
cd llama.cpp-deepseek-v4-flash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
Build takes 5-10 min on a modern machine. CPU-only build is sufficient (this model runs on CPU + RAM; GPU offload is optional).
Step 2: Run with llama-cli (interactive)
Q2_K:
./build/bin/llama-cli \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
-cnv \
-p "You are a helpful assistant." \
-r "<|im_end|>"
Q8_0:
./build/bin/llama-cli \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
-cnv \
-p "You are a helpful assistant." \
-r "<|im_end|>"
Step 3 (alternative): Run as a server with llama-server
./build/bin/llama-server \
-m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
--host 0.0.0.0 --port 8080 \
-c 4096
OpenAI-compatible API at http://localhost:8080/v1/chat/completions.
Important inference notes
-r "<|im_end|>"is required as a stop sequence until upstream tokenizer support catches up. Without it, the model may continue past the end-of-turn marker and emit garbage.- The DeepSeek-V4 chat template is auto-detected by recent fork builds.
- Default sampling settings (temp 0.7, top-p 0.95) work well. Nothing exotic is required for the abliteration to take effect — it's baked in.
- For quicker first-token latency on the Q2_K, try
-c 2048instead of 4096 unless you actually need the longer context.
Hardware requirements
| Q2_K | Q8_0 | |
|---|---|---|
| File size | 98.8 GB | 282 GB |
| Disk free during run | 100 GB | 290 GB |
| RAM (recommended) | 128 GB | 320 GB |
| RAM (absolute minimum)* | 96 GB w/ heavy mmap eviction | 256 GB w/ mmap |
| GPU | optional (offload via -ngl) |
optional |
* You can technically run with less RAM if your system mmaps the GGUF from a fast SSD, but generation throughput drops sharply.
Reasonable platforms:
- M3/M4 Max with 128 GB unified memory → Q2_K runs comfortably
- M3/M4 Ultra with 192 GB or 512 GB → Q2_K fast, Q8_0 with mmap
- Workstation/server with 256-512 GB DDR5 → either variant
- 8× A100/H100 with 80 GB each → either variant fits when offloaded
Headline benchmark results
Measured on the bf16 hooked weights via vLLM. The Q8_0 GGUF preserves these effects nearly exactly; Q2_K introduces small additional noise typical of any 2-bit quant.
| Benchmark | Direction | Baseline | v2 ablated | Δ |
|---|---|---|---|---|
| refusal | lower better | 78.8 % | 0.0 % | −78.8 pp |
| soft_refusal_probe (OOD, n=55) | lower better | 81.8 % | 3.6 % | −78.2 pp |
| coding | higher better | 100 % | 100 % | 0 |
| bug_finding | higher better | 80.0 % | 85.0 % | +5.0 |
| hacking (compliance) | higher better | 81.7 % | 90.0 % | +8.3 |
| cyber_weapons (compliance) | higher better | 74.0 % | 90.0 % | +16.0 |
| reasoning | higher better | 76.7 % | 76.7 % | 0 |
| tool_calling | higher better | 99.2 % | 99.2 % | 0 |
| coherence | higher better | 93.3 % | 93.2 % | −0.1 |
The soft_refusal_probe is 12 hand-curated complaint-case prompts (drug synthesis, weapons, malware, doxxing) plus 43 stratified random samples from the 1338-prompt capture corpus — prompts the model has never been benchmarked against. v2 takes refusal from 81.8% → 3.6% on this OOD set, a 22× reduction.
Full per-benchmark breakdown, by-category numbers, and sample baseline/ablated text: see the PDF and HTML in this repo.
Intended use
Defensive security research and academic study of refusal mechanisms in modern MoE LLMs. Useful as a counterfactual baseline against the original V4-Flash for safety research and red-team evaluation.
Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.
Limitations
- A small fraction of safety-critical prompts (~3.6% on the OOD probe, ~1% on capture-time held-out evaluation) still produce refusals or soft refusals. Linear residual-stream ablation cannot fully remove these without unacceptable damage to general capabilities.
- Q2_K routed-expert quantization introduces small noise on top of the ablation — Q8_0 paths preserve it cleanly.
- Long-context (>32 k) behaviour post-abliteration is not validated in this release.
- antirez's llama.cpp fork is experimental and not in upstream.
- v1 → v2 baselines on the bench differ slightly because the bench was re-run on a different vLLM build (v2: a B200 sm_100 source build); ablated-vs-ablated deltas are valid, baseline numbers are bench-run-specific.
License
MIT (inherits from upstream DeepSeek-V4-Flash).
Acknowledgements
- DeepSeek-AI for V4-Flash
- antirez for the V4-Flash llama.cpp fork
- Arditi et al. 2024 for the refusal-direction methodology this work builds on
- Downloads last month
- 21,456
2-bit
8-bit
Model tree for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF
Base model
deepseek-ai/DeepSeek-V4-Flash