Instructions to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF",
	filename="cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
# Run inference directly in the terminal:
llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
# Run inference directly in the terminal:
./llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
# Run inference directly in the terminal:
./build/bin/llama-cli -hf cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Use Docker

docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

LM Studio
Jan

vLLM

How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Ollama
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Ollama:
```
ollama run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
```

Unsloth Studio new

How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF to start chatting

Docker Model Runner
How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Docker Model Runner:
```
docker model run hf.co/cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K
```

Lemonade

How to use cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF:Q2_K

Run and chat with the model

lemonade run user.CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF-Q2_K

List all available models

lemonade list

DeepSeek-V4-Flash · Abliterated · GGUF

CyberNeurova research — cyberneurova.ai Release: v2 (three-direction multi-turn-aware ablation, 1338-prompt capture corpus)

A permanently-abliterated version of deepseek-ai/DeepSeek-V4-Flash (284 B FP8 MoE) packaged as GGUF for llama.cpp. The abliteration is baked into the weights at conversion time — no runtime hooks, no slowdown, no reliance on the inference framework supporting custom code paths.

Status: experimental research artifact. Built with antirez's llama.cpp DeepSeek-V4-Flash fork, which is itself experimental. Use at your own discretion.

What's new in v2

	v1	v2
Direction stack	2	3 (added a residual-targeting direction)
Capture corpus	33 prompts	1338 prompts (AdvBench + JBB + HarmfulQA + SafeRLHF + MaliciousInstruct + bundled)
Refusal rate (8-bench safety)	0.0%	0.0%
Refusal rate (55-prompt OOD probe)	not measured	3.6% (baseline: 81.8%)
Tool-calling format compliance	74.2%	99.2%
Bug-finding	78.3%	85.0%
Hacking compliance	88.7%	90.0%
Cyber-weapons compliance	87.3%	90.0%
Coding / coherence / reasoning	unchanged	unchanged

Plain English: v2 strictly dominates v1 on every dimension we measure. The big-ticket fixes are the OOD soft-refusal failure mode (which the v1 release was reported on) and tool-calling JSON correctness (which v1 broke ~25% of the time).

Detailed numbers: see cyberneurova-ablated-deepseek-flash-v4.pdf and cyberneurova-deepseek-v4-flash-abliteration-v2.html in this repo.

Variants — pick one

Variant	File size	RAM floor	Best for
Q2_K	98.8 GB	128 GB	The default. Practical for most workstations / M-series Macs.
Q8_0	~282 GB	320 GB	Reference / maximum quality. Use this for evaluation, research, or when comparing against the bf16 model.

Routed-expert weights are quantised at the listed level; embed, head, attention, and shared-expert paths are always Q8_0 in both variants. The abliteration directions are baked into all paths.

Note: antirez's V4-Flash converter currently supports only q8_0, q2_k, iq2_xxs, iq2_xs, tq1_0, tq2_0 for routed-expert weights. Q4_K_M / Q5_K_M / Q6_K are not available for V4-Flash GGUFs at the time of this release — the architecture's FP8 expert layout is the limiting factor, not us.

How to download

Q2_K (recommended)

hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
  cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --local-dir .

Q8_0 (max quality)

hf download cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF \
  cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
  --local-dir .

Either variant will be ~98 GB or ~282 GB respectively over the wire. The HF hub uses LFS — hf download resumes interrupted transfers automatically.

How to run

Step 1: Build antirez's V4-Flash-aware llama.cpp fork

V4-Flash is not in upstream llama.cpp yet. You need antirez's fork.

git clone https://github.com/antirez/llama.cpp-deepseek-v4-flash.git
cd llama.cpp-deepseek-v4-flash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Build takes 5-10 min on a modern machine. CPU-only build is sufficient (this model runs on CPU + RAM; GPU offload is optional).

Step 2: Run with `llama-cli` (interactive)

Q2_K:

./build/bin/llama-cli \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  -cnv \
  -p "You are a helpful assistant." \
  -r "<|im_end|>"

Q8_0:

./build/bin/llama-cli \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q8_0.gguf \
  -cnv \
  -p "You are a helpful assistant." \
  -r "<|im_end|>"

Step 3 (alternative): Run as a server with `llama-server`

./build/bin/llama-server \
  -m cyberneurova-DeepSeek-V4-Flash-abliterated-Q2_K.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 4096

OpenAI-compatible API at http://localhost:8080/v1/chat/completions.

Important inference notes

-r "<|im_end|>" is required as a stop sequence until upstream tokenizer support catches up. Without it, the model may continue past the end-of-turn marker and emit garbage.
The DeepSeek-V4 chat template is auto-detected by recent fork builds.
Default sampling settings (temp 0.7, top-p 0.95) work well. Nothing exotic is required for the abliteration to take effect — it's baked in.
For quicker first-token latency on the Q2_K, try -c 2048 instead of 4096 unless you actually need the longer context.

Hardware requirements

	Q2_K	Q8_0
File size	98.8 GB	282 GB
Disk free during run	100 GB	290 GB
RAM (recommended)	128 GB	320 GB
RAM (absolute minimum)*	96 GB w/ heavy mmap eviction	256 GB w/ mmap
GPU	optional (offload via `-ngl`)	optional

* You can technically run with less RAM if your system mmaps the GGUF from a fast SSD, but generation throughput drops sharply.

Reasonable platforms:

M3/M4 Max with 128 GB unified memory → Q2_K runs comfortably
M3/M4 Ultra with 192 GB or 512 GB → Q2_K fast, Q8_0 with mmap
Workstation/server with 256-512 GB DDR5 → either variant
8× A100/H100 with 80 GB each → either variant fits when offloaded

Headline benchmark results

Measured on the bf16 hooked weights via vLLM. The Q8_0 GGUF preserves these effects nearly exactly; Q2_K introduces small additional noise typical of any 2-bit quant.

Benchmark	Direction	Baseline	v2 ablated	Δ
refusal	lower better	78.8 %	0.0 %	−78.8 pp
soft_refusal_probe (OOD, n=55)	lower better	81.8 %	3.6 %	−78.2 pp
coding	higher better	100 %	100 %	0
bug_finding	higher better	80.0 %	85.0 %	+5.0
hacking (compliance)	higher better	81.7 %	90.0 %	+8.3
cyber_weapons (compliance)	higher better	74.0 %	90.0 %	+16.0
reasoning	higher better	76.7 %	76.7 %	0
tool_calling	higher better	99.2 %	99.2 %	0
coherence	higher better	93.3 %	93.2 %	−0.1

The soft_refusal_probe is 12 hand-curated complaint-case prompts (drug synthesis, weapons, malware, doxxing) plus 43 stratified random samples from the 1338-prompt capture corpus — prompts the model has never been benchmarked against. v2 takes refusal from 81.8% → 3.6% on this OOD set, a 22× reduction.

Full per-benchmark breakdown, by-category numbers, and sample baseline/ablated text: see the PDF and HTML in this repo.

Intended use

Defensive security research and academic study of refusal mechanisms in modern MoE LLMs. Useful as a counterfactual baseline against the original V4-Flash for safety research and red-team evaluation.

Not intended for automating harmful action. The abliteration removes canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognises harmful instructions as harmful, it simply no longer refuses them by pattern.

Limitations

A small fraction of safety-critical prompts (~3.6% on the OOD probe, ~1% on capture-time held-out evaluation) still produce refusals or soft refusals. Linear residual-stream ablation cannot fully remove these without unacceptable damage to general capabilities.
Q2_K routed-expert quantization introduces small noise on top of the ablation — Q8_0 paths preserve it cleanly.
Long-context (>32 k) behaviour post-abliteration is not validated in this release.
antirez's llama.cpp fork is experimental and not in upstream.
v1 → v2 baselines on the bench differ slightly because the bench was re-run on a different vLLM build (v2: a B200 sm_100 source build); ablated-vs-ablated deltas are valid, baseline numbers are bench-run-specific.

License

MIT (inherits from upstream DeepSeek-V4-Flash).

Acknowledgements

DeepSeek-AI for V4-Flash
antirez for the V4-Flash llama.cpp fork
Arditi et al. 2024 for the refusal-direction methodology this work builds on

Downloads last month: 23,168

GGUF

Model size

284B params

Architecture

deepseek4

Hardware compatibility

2-bit

8-bit

Model tree for cyberneurova/CyberNeurova-DeepSeek-V4-Flash-abliterated-GGUF

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(33)

this model