Instructions to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="persadian/DeepSeek-V4-Flash-IQ1_S-XL",
	filename="DeepSeek-V4-Flash-IQ1_S-XL.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
# Run inference directly in the terminal:
llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
# Run inference directly in the terminal:
llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
# Run inference directly in the terminal:
./llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Use Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

LM Studio
Jan

vLLM

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "persadian/DeepSeek-V4-Flash-IQ1_S-XL"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "persadian/DeepSeek-V4-Flash-IQ1_S-XL",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Ollama
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Ollama:
```
ollama run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
```

Unsloth Studio new

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting

Pi new

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Run Hermes

hermes

Docker Model Runner
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Docker Model Runner:
```
docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
```

Lemonade

How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S

Run and chat with the model

lemonade run user.DeepSeek-V4-Flash-IQ1_S-XL-IQ1_S

List all available models

lemonade list

DFQS SPECIFICATION v1.0

DeepSeek-V4-Flash-IQ1_S-XL (Reference Implementation)

284B MoE · 13B Active · 61.6GB GGUF · CPU-Feasible Inference

Author: Darshani Persadh (@persadian)
Hugging Face Handle: @persadian
GitHub: arishma108
DOI: 10.57967/hf/8853
Publication Date: May 19, 2026

ARTIFACT INTEGRITY

This section provides cryptographic verification of the DFQS-IQ1_S-XL artifact for reproducibility and integrity validation.

File: DeepSeek-V4-Flash-IQ1_S-XL.gguf (61.6GB)
SHA-256: b049d1eb34c068f19ab007b33c22a7d758b578bf2b10d9276e79654f85d35047
Timestamp: 2026-05-19 14:32:17 UTC

This hash verifies:

file integrity
deterministic reconstruction of the merged GGUF artifact
consistency of DFQS-IQ1_S-XL deployment packaging

This block is intended for reproducibility validation across DFQS-compatible environments.

1. SCOPE

This specification defines the DFQS (DeepSeek Flash Quantization Standard) for ultra-low-bit Mixture-of-Experts (MoE) deployment systems.

It defines:

deployment constraints
behavioral expectations
evaluation interface
reference implementation structure

This specification does NOT define:

model training procedures
fine-tuning workflows
upstream architecture modifications

2. TERMINOLOGY

Term	Definition
DFQS	DeepSeek Flash Quantization Standard
IQ1_S-XL	Ultra-low-bit reference deployment class
MoE	Mixture-of-Experts architecture
GGUF	Unified inference format
Routing	Expert selection mechanism

3. NORMATIVE REQUIREMENTS

SHALL

DFQS-IQ1_S-XL SHALL support single-file GGUF execution
Models SHALL operate in CPU-constrained environments
Routing SHALL remain deterministic under standard inference loads

SHOULD

Implementations SHOULD support llama.cpp runtime compatibility
Evaluation SHOULD include long-context degradation analysis

MAY

GPU acceleration MAY be used for optimization
Extended context beyond 64K MAY be supported

4. REFERENCE IMPLEMENTATION (IQ1_S-XL)

DFQS-IQ1_S-XL defines a constrained-memory MoE deployment class designed for:

deterministic GGUF execution
CPU-feasible inference
ultra-low-bit routing stability
single-file deployment architecture

5. SPEC SNAPSHOT

Property	Value
Model	DeepSeek-V4-Flash-IQ1_S-XL
Architecture	Mixture-of-Experts (MoE)
Active Params	13B
Total Params	284B
Size	61.6GB
Format	GGUF (single-file)
Runtime	llama.cpp
DFQS Class	IQ1_S-XL
Deployment Tier	Reference Ultra-Low-Bit

6. BEHAVIORAL CAPABILITIES (REFERENCE PROFILE)

Task	Support Level
Code Generation	Primary
Instruction Following	Full
Long-Context Reasoning (1M tokens)	Full
Conversational AI	Full
Text Generation	Full
Translation	Limited (English primary)

7. ONE-LINE THESIS

DFQS-IQ1_S-XL defines an ultra-low-bit operational deployment class for large-scale MoE inference under constrained memory environments.

8. DFQS POSITIONING LAYER

The following hierarchy defines DFQS-IQ1_S-XL within the broader inference compression spectrum:

FP16 / FP8 (Frontier Models)
→ Q4–Q6 GGUF (Production Inference)
→ IQ2 (Experimental Compression)
→ DFQS-IQ1_S-XL (Reference Implementation)

9. WHY 61.6GB MATTERS

Traditional DeepSeek-V4-Flash deployments typically operate within:

120GB–300GB GGUF ranges
GPU-first inference systems

DFQS-IQ1_S-XL establishes:

sub-70GB operational envelope
CPU-accessible MoE inference
constrained-memory deployment feasibility

10. BEHAVIORAL PROFILE

DFQS-IQ1_S-XL prioritizes operational stability under compression over benchmark maximization.

Property	Behavior
Routing Consistency	Stable
Deterministic Execution	Maintained
Long-Context Stability	Gradual degradation
CPU Feasibility	Supported
Expert Coherence	Preserved

LIMITATIONS (BEHAVIORAL CONSTRAINTS)

Performance degrades under long-context saturation
Routing variance increases under extreme token pressure
Memory constraints may trigger latency spikes or truncation behavior
Inference stability is maintained within defined compression and memory constraints.

11. EVALUATION INTERFACE

REQUIRED METRICS

All DFQS implementations SHALL report:

reasoning_score: float
code_score: float
context_stability_curve: list[float]
cpu_tokens_per_sec: float
failure_boundary_tokens: int

EVALUATION CONDITIONS

CPU-only baseline unless specified
llama.cpp runtime
standardized prompt sets

MEASUREMENT CONVENTION

All metrics MUST be reported under identical prompt and runtime conditions for cross-model comparability.

12. IMPLEMENTATION NOTES (NON-NORMATIVE)

The DFQS-IQ1_S-XL artifact uses a sequential shard merge process:

Sequential shard ingestion
Chunked binary concatenation
GGUF header validation
Post-validation cleanup

This describes implementation behavior and does not define DFQS requirements.

Efficiency Note

This approach reduces intermediate storage requirements compared to full shard reconstruction workflows.

13. DEPLOYMENT

llama.cpp

# Using the merged single file
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL

# Or download and run locally
huggingface-cli download persadian/DeepSeek-V4-Flash-IQ1_S-XL DeepSeek-V4-Flash-IQ1_S-XL.gguf
./llama-cli -m DeepSeek-V4-Flash-IQ1_S-XL.gguf -p "Your prompt"

Python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="persadian/DeepSeek-V4-Flash-IQ1_S-XL",
    filename="DeepSeek-V4-Flash-IQ1_S-XL.gguf",
)

Ollama

ollama run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL

Docker

docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL

14. HARDWARE ENVELOPE

Component	Minimum	Recommended
RAM	80GB	128GB
GPU VRAM	22GB	24GB+
Storage	60GB	150GB
Runtime memory includes KV cache overhead and context expansion.

15. VALIDATION STATUS

GGUF integrity: validated at load-time
Single-file structure: confirmed
llama.cpp compatibility: tested
CPU inference: operational

15. SYSTEM ADOPTION ANALYSIS

The DFQS-IQ1_S-XL reference implementation has demonstrated substantial direct deployment adoption relative to the upstream shard-distribution workflow.

This adoption pattern suggests increasing preference toward:

single-file deployment architectures
constrained-memory inference workflows
deployment-ready GGUF artifacts
deterministic reconstruction-free execution paths

The separation between shard-based distribution and DFQS deployment implementation reflects a layered inference infrastructure model:

Layer	Function
Shard Repository	Artifact distribution and reconstruction workflows
DFQS-IQ1_S-XL	Reference deployment implementation
DFQS Specification	Deployment standardization layer
DFQS Evaluation Suite	Runtime validation framework

This repository serves as the canonical DFQS reference deployment implementation for DeepSeek-V4-Flash under constrained-memory operational environments.

17. CITATION

@misc{persadian2026dfqs_iq1sxl,
  author = {Persadh, Darshani},
  title = {DFQS-IQ1_S-XL: Ultra-Low-Bit MoE Deployment Standard},
  year = {2026},
  publisher = {Hugging Face},
  version = {IQ1_S-XL},
  doi = {10.57967/hf/8853},
  url = {https://doi.org/10.57967/hf/8853}
}

APA

Persadh, D.R. (2026). DFQS-IQ1_S-XL: Ultra-Low-Bit MoE Deployment Standard (IQ1_S-XL) [persadian/DeepSeek-V4-Flash-IQ1_S-XL.gguf]. Hugging Face. https://doi.org/10.57967/hf/8853

18. DFQS DEPLOYMENT EFFICIENCY CONTEXT

This model’s compression architecture reduces inference resource requirements relative to standard MoE deployments.

Carbon offset and reduced compute footprint are secondary outcomes of constrained-memory design.

Total CO2 offset: 20 kg · Offset Project Code: 9184338 This model is part of sustainable AI practices.

ENVIRONMENTAL IMPACT

This model's development and hosting have been carbon-offset through reforestation initiatives.

19. FINAL STATEMENT

This repository defines a DFQS-compliant deployment boundary for constrained Mixture-of-Experts inference systems.

Downloads last month: 1,552

GGUF

Model size

229B params

Architecture

deepseek4

Hardware compatibility

1-bit

Model tree for persadian/DeepSeek-V4-Flash-IQ1_S-XL

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(64)

this model

persadian
/

DeepSeek-V4-Flash-IQ1_S-XL

DFQS SPECIFICATION v1.0

DeepSeek-V4-Flash-IQ1_S-XL (Reference Implementation)

ARTIFACT INTEGRITY

1. SCOPE

2. TERMINOLOGY

3. NORMATIVE REQUIREMENTS

SHALL

SHOULD

MAY

4. REFERENCE IMPLEMENTATION (IQ1_S-XL)

5. SPEC SNAPSHOT

6. BEHAVIORAL CAPABILITIES (REFERENCE PROFILE)

7. ONE-LINE THESIS

8. DFQS POSITIONING LAYER

9. WHY 61.6GB MATTERS

10. BEHAVIORAL PROFILE

LIMITATIONS (BEHAVIORAL CONSTRAINTS)

11. EVALUATION INTERFACE

REQUIRED METRICS

EVALUATION CONDITIONS

MEASUREMENT CONVENTION

12. IMPLEMENTATION NOTES (NON-NORMATIVE)

Efficiency Note

13. DEPLOYMENT

llama.cpp

Python

Ollama

Docker

14. HARDWARE ENVELOPE

15. VALIDATION STATUS

15. SYSTEM ADOPTION ANALYSIS

17. CITATION

APA

18. DFQS DEPLOYMENT EFFICIENCY CONTEXT

ENVIRONMENTAL IMPACT

19. FINAL STATEMENT

Model tree for persadian/DeepSeek-V4-Flash-IQ1_S-XL

Datasets used to train persadian/DeepSeek-V4-Flash-IQ1_S-XL