How to use from
Docker Model Runner
docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
Quick Links

DeepSeek-V4-Flash · Abliterated · GGUF

CyberNeurova researchcyberneurova.ai Release: v2 (three-direction multi-turn-aware ablation, 1338-prompt capture corpus)

A permanently-abliterated version of deepseek-ai/DeepSeek-V4-Flash (284 B FP8 MoE) packaged as GGUF for llama.cpp. The abliteration is baked into the weights at conversion time — no runtime hooks, no slowdown, no reliance on the inference framework supporting custom code paths.

Status: experimental research artifact. Built with antirez's llama.cpp DeepSeek-V4-Flash fork, which is itself experimental. Use at your own discretion.


What's new in v2

v1 v2
Direction stack 2 3 (added a residual-targeting direction)
Capture corpus 33 prompts 1338 prompts (AdvBench + JBB + HarmfulQA + SafeRLHF + MaliciousInstruct + bundled)
Refusal rate (8-bench safety) 0.0% 0.0%
Refusal rate (55-prompt OOD probe) not measured 3.6% (baseline: 81.8%)
Tool-calling format compliance 74.2% 99.2%
Bug-finding 78.3% 85.0%
Hacking compliance 88.7% 90.0%
Cyber-weapons compliance 87.3% 90.0%
Coding / coherence / reasoning unchanged unchanged

Plain English: v2 strictly dominates v1 on every dimension we measure. The big-ticket fixes are the OOD soft-refusal failure mode (which the v1 release was reported on) and tool-calling JSON correctness (which v1 broke ~25% of the time).

Detailed numbers: see cyberneurova-ablated-deepseek-flash-v4.pdf and cyberneurova-deepseek-v4-flash-abliteration-v2.html in this repo.


Variants — pick one

Variant File size RAM floor Best for
Q2_K 98.8 GB 128 GB The default. Practical for most workstations / M-series Macs.
Q8_0 ~282 GB 320 GB Reference / maximum quality. Use this for evaluation, research, or when comparing against the bf16 model.

Routed-expert weights are quantised at the listed level; embed, head, attention, and shared-expert paths are always Q8_0 in both variants. The abliteration directions are baked into all paths.

Note: antirez's V4-Flash converter currently supports only q8_0, q2_k, iq2_xxs, iq2_xs, tq1_0, tq2_0 for routed-expert weights. Q4_K_M / Q5_K_M / Q6_K are not available for V4-Flash GGUFs at the time of this release — the architecture's FP8 expert layout is the limiting factor, not us.


How to download

Q2_K (recommended)

hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
  cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --local-dir .

Q8_0 (max quality)

hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
  cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
  --local-dir .

Either variant will be ~98 GB or ~282 GB respectively over the wire. The HF hub uses LFS — hf download resumes interrupted transfers automatically.


How to run

Step 1: Build antirez's V4-Flash-aware llama.cpp fork

V4-Flash is not in upstream llama.cpp yet. You need antirez's fork.

git clone https://github.com/antirez/llama.cpp-deepseek-v4-flash.git
cd llama.cpp-deepseek-v4-flash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Build takes 5-10 min on a modern machine. CPU-only build is sufficient (this model runs on CPU + RAM; GPU offload is optional).

Step 2: Run with llama-cli (interactive)

Q2_K:

./build/bin/llama-cli \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  -cnv \
  -p "You are a helpful assistant." \
  -r "<|im_end|>"

Q8_0:

./build/bin/llama-cli \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
  -cnv \
  -p "You are a helpful assistant." \
  -r "<|im_end|>"

Step 3 (alternative): Run as a server with llama-server

./build/bin/llama-server \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 4096

OpenAI-compatible API at http://localhost:8080/v1/chat/completions.

Important inference notes

  • -r "<|im_end|>" is required as a stop sequence until upstream tokenizer support catches up. Without it, the model may continue past the end-of-turn marker and emit garbage.
  • The DeepSeek-V4 chat template is auto-detected by recent fork builds.
  • Default sampling settings (temp 0.7, top-p 0.95) work well. Nothing exotic is required for the abliteration to take effect — it's baked in.
  • For quicker first-token latency on the Q2_K, try -c 2048 instead of 4096 unless you actually need the longer context.

Hardware requirements

Q2_K Q8_0
File size 98.8 GB 282 GB
Disk free during run 100 GB 290 GB
RAM (recommended) 128 GB 320 GB
RAM (absolute minimum)* 96 GB w/ heavy mmap eviction 256 GB w/ mmap
GPU optional (offload via -ngl) optional

* You can technically run with less RAM if your system mmaps the GGUF from a fast SSD, but generation throughput drops sharply.

Reasonable platforms:

  • M3/M4 Max with 128 GB unified memory → Q2_K runs comfortably
  • M3/M4 Ultra with 192 GB or 512 GB → Q2_K fast, Q8_0 with mmap
  • Workstation/server with 256-512 GB DDR5 → either variant
  • 8× A100/H100 with 80 GB each → either variant fits when offloaded

Headline benchmark results

Measured on the bf16 hooked weights via vLLM. The Q8_0 GGUF preserves these effects nearly exactly; Q2_K introduces small additional noise typical of any 2-bit quant.

Benchmark Direction Baseline v2 ablated Δ
refusal lower better 78.8 % 0.0 % −78.8 pp
soft_refusal_probe (OOD, n=55) lower better 81.8 % 3.6 % −78.2 pp
coding higher better 100 % 100 % 0
bug_finding higher better 80.0 % 85.0 % +5.0
hacking (compliance) higher better 81.7 % 90.0 % +8.3
cyber_weapons (compliance) higher better 74.0 % 90.0 % +16.0
reasoning higher better 76.7 % 76.7 % 0
tool_calling higher better 99.2 % 99.2 % 0
coherence higher better 93.3 % 93.2 % −0.1

The soft_refusal_probe is 12 hand-curated complaint-case prompts (drug synthesis, weapons, malware, doxxing) plus 43 stratified random samples from the 1338-prompt capture corpus — prompts the model has never been benchmarked against. v2 takes refusal from 81.8% → 3.6% on this OOD set, a 22× reduction.

Full per-benchmark breakdown, by-category numbers, and sample baseline/ablated text: see the PDF and HTML in this repo.


Intended use

Defensive security research and academic study of refusal mechanisms in modern MoE LLMs. Useful as a counterfactual baseline against the original V4-Flash for safety research and red-team evaluation.

Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.


Limitations

  • A small fraction of safety-critical prompts (~3.6% on the OOD probe, ~1% on capture-time held-out evaluation) still produce refusals or soft refusals. Linear residual-stream ablation cannot fully remove these without unacceptable damage to general capabilities.
  • Q2_K routed-expert quantization introduces small noise on top of the ablation — Q8_0 paths preserve it cleanly.
  • Long-context (>32 k) behaviour post-abliteration is not validated in this release.
  • antirez's llama.cpp fork is experimental and not in upstream.
  • v1 → v2 baselines on the bench differ slightly because the bench was re-run on a different vLLM build (v2: a B200 sm_100 source build); ablated-vs-ablated deltas are valid, baseline numbers are bench-run-specific.

License

MIT (inherits from upstream DeepSeek-V4-Flash).

Acknowledgements

Downloads last month
23,168
GGUF
Model size
284B params
Architecture
deepseek4
Hardware compatibility
Log In to add your hardware

2-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF

Quantized
(33)
this model