Not even close to an ideal quantization. I just requantized am17an/Qwen3.6-35BA3B-MTP-GGUF (which is Q8_0) to Q4_0.
Made using this llama.cpp PR
Flags I use for inference (25tk/s to 37tk/s | 1,5x~ speedup):
./llama.cpp/build/bin/llama-server --host 0.0.0.0 --port 5000 \
-m ./Models/Qwen3.6-35BA3B-MTP-Q4_0.gguf --fit on \
-a "Qwen3.6-35B-A3B" -c 200000 \
--top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.0 --presence-penalty 0.0 \
-fa on --temp 0.6 -ctk q8_0 -ctv q8_0 --batch-size 4096 \
--reasoning off --no-mmap --jinja -t 6 --checkpoint-every-n-tokens 4096 --ctx-checkpoints 64 -fitt 512 \
--spec-draft-n-max 2
- Downloads last month
- 974
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for bombdefuser-124/Qwen3.6-35BA3B-MTP-Q4_0-GGUF
Base model
am17an/Qwen3.6-35BA3B-MTP-GGUF