Instructions to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="persadian/DeepSeek-V4-Flash-IQ1_S-XL", filename="DeepSeek-V4-Flash-IQ1_S-XL.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S # Run inference directly in the terminal: llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S # Run inference directly in the terminal: llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S # Run inference directly in the terminal: ./llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Use Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
- LM Studio
- Jan
- vLLM
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "persadian/DeepSeek-V4-Flash-IQ1_S-XL" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "persadian/DeepSeek-V4-Flash-IQ1_S-XL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
- Ollama
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Ollama:
ollama run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
- Unsloth Studio new
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for persadian/DeepSeek-V4-Flash-IQ1_S-XL to start chatting
- Pi new
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Run Hermes
hermes
- Docker Model Runner
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Docker Model Runner:
docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
- Lemonade
How to use persadian/DeepSeek-V4-Flash-IQ1_S-XL with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull persadian/DeepSeek-V4-Flash-IQ1_S-XL:IQ1_S
Run and chat with the model
lemonade run user.DeepSeek-V4-Flash-IQ1_S-XL-IQ1_S
List all available models
lemonade list
- DFQS SPECIFICATION v1.0
- DeepSeek-V4-Flash-IQ1_S-XL (Reference Implementation)
- ARTIFACT INTEGRITY
- 1. SCOPE
- 2. TERMINOLOGY
- 3. NORMATIVE REQUIREMENTS
- 4. REFERENCE IMPLEMENTATION (IQ1_S-XL)
- 5. SPEC SNAPSHOT
- 6. BEHAVIORAL CAPABILITIES (REFERENCE PROFILE)
- 7. ONE-LINE THESIS
- 8. DFQS POSITIONING LAYER
- 9. WHY 61.6GB MATTERS
- 10. BEHAVIORAL PROFILE
- 11. EVALUATION INTERFACE
- 12. IMPLEMENTATION NOTES (NON-NORMATIVE)
- 13. DEPLOYMENT
- 14. HARDWARE ENVELOPE
- 15. VALIDATION STATUS
- 15. SYSTEM ADOPTION ANALYSIS
- 17. CITATION
- APA
- 18. DFQS DEPLOYMENT EFFICIENCY CONTEXT
- ENVIRONMENTAL IMPACT
- 19. FINAL STATEMENT
- DeepSeek-V4-Flash-IQ1_S-XL (Reference Implementation)
DFQS SPECIFICATION v1.0
DeepSeek-V4-Flash-IQ1_S-XL (Reference Implementation)
284B MoE · 13B Active · 61.6GB GGUF · CPU-Feasible Inference
Author: Darshani Persadh (@persadian)
Hugging Face Handle: @persadian
GitHub: arishma108
DOI: 10.57967/hf/8853
Publication Date: May 19, 2026
ARTIFACT INTEGRITY
This section provides cryptographic verification of the DFQS-IQ1_S-XL artifact for reproducibility and integrity validation.
File: DeepSeek-V4-Flash-IQ1_S-XL.gguf (61.6GB)
SHA-256: b049d1eb34c068f19ab007b33c22a7d758b578bf2b10d9276e79654f85d35047
Timestamp: 2026-05-19 14:32:17 UTC
This hash verifies:
- file integrity
- deterministic reconstruction of the merged GGUF artifact
- consistency of DFQS-IQ1_S-XL deployment packaging
This block is intended for reproducibility validation across DFQS-compatible environments.
1. SCOPE
This specification defines the DFQS (DeepSeek Flash Quantization Standard) for ultra-low-bit Mixture-of-Experts (MoE) deployment systems.
It defines:
- deployment constraints
- behavioral expectations
- evaluation interface
- reference implementation structure
This specification does NOT define:
- model training procedures
- fine-tuning workflows
- upstream architecture modifications
2. TERMINOLOGY
| Term | Definition |
|---|---|
| DFQS | DeepSeek Flash Quantization Standard |
| IQ1_S-XL | Ultra-low-bit reference deployment class |
| MoE | Mixture-of-Experts architecture |
| GGUF | Unified inference format |
| Routing | Expert selection mechanism |
3. NORMATIVE REQUIREMENTS
SHALL
- DFQS-IQ1_S-XL SHALL support single-file GGUF execution
- Models SHALL operate in CPU-constrained environments
- Routing SHALL remain deterministic under standard inference loads
SHOULD
- Implementations SHOULD support llama.cpp runtime compatibility
- Evaluation SHOULD include long-context degradation analysis
MAY
- GPU acceleration MAY be used for optimization
- Extended context beyond 64K MAY be supported
4. REFERENCE IMPLEMENTATION (IQ1_S-XL)
DFQS-IQ1_S-XL defines a constrained-memory MoE deployment class designed for:
- deterministic GGUF execution
- CPU-feasible inference
- ultra-low-bit routing stability
- single-file deployment architecture
5. SPEC SNAPSHOT
| Property | Value |
|---|---|
| Model | DeepSeek-V4-Flash-IQ1_S-XL |
| Architecture | Mixture-of-Experts (MoE) |
| Active Params | 13B |
| Total Params | 284B |
| Size | 61.6GB |
| Format | GGUF (single-file) |
| Runtime | llama.cpp |
| DFQS Class | IQ1_S-XL |
| Deployment Tier | Reference Ultra-Low-Bit |
6. BEHAVIORAL CAPABILITIES (REFERENCE PROFILE)
| Task | Support Level |
|---|---|
| Code Generation | Primary |
| Instruction Following | Full |
| Long-Context Reasoning (1M tokens) | Full |
| Conversational AI | Full |
| Text Generation | Full |
| Translation | Limited (English primary) |
7. ONE-LINE THESIS
DFQS-IQ1_S-XL defines an ultra-low-bit operational deployment class for large-scale MoE inference under constrained memory environments.
8. DFQS POSITIONING LAYER
The following hierarchy defines DFQS-IQ1_S-XL within the broader inference compression spectrum:
FP16 / FP8 (Frontier Models)
→ Q4–Q6 GGUF (Production Inference)
→ IQ2 (Experimental Compression)
→ DFQS-IQ1_S-XL (Reference Implementation)
9. WHY 61.6GB MATTERS
Traditional DeepSeek-V4-Flash deployments typically operate within:
- 120GB–300GB GGUF ranges
- GPU-first inference systems
DFQS-IQ1_S-XL establishes:
- sub-70GB operational envelope
- CPU-accessible MoE inference
- constrained-memory deployment feasibility
10. BEHAVIORAL PROFILE
DFQS-IQ1_S-XL prioritizes operational stability under compression over benchmark maximization.
| Property | Behavior |
|---|---|
| Routing Consistency | Stable |
| Deterministic Execution | Maintained |
| Long-Context Stability | Gradual degradation |
| CPU Feasibility | Supported |
| Expert Coherence | Preserved |
LIMITATIONS (BEHAVIORAL CONSTRAINTS)
- Performance degrades under long-context saturation
- Routing variance increases under extreme token pressure
- Memory constraints may trigger latency spikes or truncation behavior
- Inference stability is maintained within defined compression and memory constraints.
11. EVALUATION INTERFACE
REQUIRED METRICS
All DFQS implementations SHALL report:
reasoning_score: float
code_score: float
context_stability_curve: list[float]
cpu_tokens_per_sec: float
failure_boundary_tokens: int
EVALUATION CONDITIONS
- CPU-only baseline unless specified
- llama.cpp runtime
- standardized prompt sets
MEASUREMENT CONVENTION
All metrics MUST be reported under identical prompt and runtime conditions for cross-model comparability.
12. IMPLEMENTATION NOTES (NON-NORMATIVE)
The DFQS-IQ1_S-XL artifact uses a sequential shard merge process:
- Sequential shard ingestion
- Chunked binary concatenation
- GGUF header validation
- Post-validation cleanup
This describes implementation behavior and does not define DFQS requirements.
Efficiency Note
This approach reduces intermediate storage requirements compared to full shard reconstruction workflows.
13. DEPLOYMENT
llama.cpp
# Using the merged single file
llama-server -hf persadian/DeepSeek-V4-Flash-IQ1_S-XL
# Or download and run locally
huggingface-cli download persadian/DeepSeek-V4-Flash-IQ1_S-XL DeepSeek-V4-Flash-IQ1_S-XL.gguf
./llama-cli -m DeepSeek-V4-Flash-IQ1_S-XL.gguf -p "Your prompt"
Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="persadian/DeepSeek-V4-Flash-IQ1_S-XL",
filename="DeepSeek-V4-Flash-IQ1_S-XL.gguf",
)
Ollama
ollama run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL
Docker
docker model run hf.co/persadian/DeepSeek-V4-Flash-IQ1_S-XL
14. HARDWARE ENVELOPE
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 80GB | 128GB |
| GPU VRAM | 22GB | 24GB+ |
| Storage | 60GB | 150GB |
| Runtime memory includes KV cache overhead and context expansion. |
15. VALIDATION STATUS
- GGUF integrity: validated at load-time
- Single-file structure: confirmed
- llama.cpp compatibility: tested
- CPU inference: operational
15. SYSTEM ADOPTION ANALYSIS
The DFQS-IQ1_S-XL reference implementation has demonstrated substantial direct deployment adoption relative to the upstream shard-distribution workflow.
This adoption pattern suggests increasing preference toward:
- single-file deployment architectures
- constrained-memory inference workflows
- deployment-ready GGUF artifacts
- deterministic reconstruction-free execution paths
The separation between shard-based distribution and DFQS deployment implementation reflects a layered inference infrastructure model:
| Layer | Function |
|---|---|
| Shard Repository | Artifact distribution and reconstruction workflows |
| DFQS-IQ1_S-XL | Reference deployment implementation |
| DFQS Specification | Deployment standardization layer |
| DFQS Evaluation Suite | Runtime validation framework |
This repository serves as the canonical DFQS reference deployment implementation for DeepSeek-V4-Flash under constrained-memory operational environments.
17. CITATION
@misc{persadian2026dfqs_iq1sxl,
author = {Persadh, Darshani},
title = {DFQS-IQ1_S-XL: Ultra-Low-Bit MoE Deployment Standard},
year = {2026},
publisher = {Hugging Face},
version = {IQ1_S-XL},
doi = {10.57967/hf/8853},
url = {https://doi.org/10.57967/hf/8853}
}
APA
Persadh, D.R. (2026). DFQS-IQ1_S-XL: Ultra-Low-Bit MoE Deployment Standard (IQ1_S-XL) [persadian/DeepSeek-V4-Flash-IQ1_S-XL.gguf]. Hugging Face. https://doi.org/10.57967/hf/8853
18. DFQS DEPLOYMENT EFFICIENCY CONTEXT
This model’s compression architecture reduces inference resource requirements relative to standard MoE deployments.
Carbon offset and reduced compute footprint are secondary outcomes of constrained-memory design.
Total CO2 offset: 20 kg · Offset Project Code: 9184338 This model is part of sustainable AI practices.
ENVIRONMENTAL IMPACT
This model's development and hosting have been carbon-offset through reforestation initiatives.

19. FINAL STATEMENT
This repository defines a DFQS-compliant deployment boundary for constrained Mixture-of-Experts inference systems.
- Downloads last month
- 1,552
1-bit
Model tree for persadian/DeepSeek-V4-Flash-IQ1_S-XL
Base model
deepseek-ai/DeepSeek-V4-Flash