Q4_0 Like Qwen3.5-35B-A3B

#3
by engrtipusultan - opened

Would it be possible for you to add Q4_0 like you did for Qwen3.5-35B-A3B which is fast on vulkan and mainline llama.cpp. I think it will be useful for many people.

https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#q4_0-19776-gib-4901-bpw

Also, is there any reason you did not bump token_embd.weight to Q8

@engrtipusultan

I'll take a look though it will be larger as Qwen3-Coder-Next is ~80B so my custom Q4_0 vulkan mix will likely be about 80*(5/8)= 50 GB or so and require quite a bit of VRAM, but would be fine on strix halo unified memory system if that is okay with you?

Also, is there any reason you did not bump token_embd.weight to Q8

Yes, the old tradition on mainline is token_embd@q4_K and final "head" output@q6_K... For ik_llama.cpp I use the newer variantes e.g. iq4_k and iq6_k which I've measured as slightly better.

So going with q4_1 at 5bpw is still comparable or slightly better to the q4_K. It saves a chunk of VRAM which can then be used for longer kv-cache context given it is fairly tight in 24GB VRAM already. Its all trade offs, and I wouldn't expect much benefit to PPL/KLD bumping it to the larger q8_0 which would also slow down TG slightly.

Yeah I'll cook it up and upload it then update the perplexity graphs: llama_model_quantize_internal: quant size = 45419.73 MB

@engrtipusultan

Keep me posted how it goes with your rig! enjoy: https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF?show_file_info=Qwen3-Coder-Next-Q4_0.gguf

it has quite good perplexity too!

I have 56Gb unified available for inference so I think that will be fine for all 64GB unified systems.
Have you done any testing on iq4_nl vs q4_1 for perplexity and KLD. Atleast on my system Iq4_nl is faster.

I wish there was no dispute between ik and greg it would have been overall better for community.

@engrtipusultan

Keep me posted how it goes with your rig! enjoy: https://huggingface.co/ubergarm/Qwen3-Coder-Next-GGUF?show_file_info=Qwen3-Coder-Next-Q4_0.gguf

it has quite good perplexity too!

Wow that was fast awesome. Thank you

Have you done any testing on iq4_nl vs q4_1 for perplexity and KLD. Atleast on my system Iq4_nl is faster.

Hrmm, i haven't done a 1 to 1 comparison swapping out only q4_1 for iq4_nl... i'm surprised that iq4_nl is faster for you on both prompt processing and token generation... iq4_nl may be slightly faster on TG given it is a little smaller like 4.5ish bpw vs q4_1's 5ish bpw i think

I'd love to see some numbers on that and will keep it in the back of my head as there may very well be iq4_nl might e better than q4_1...

agreed i wish we could run them all on both forks...

Just wanted to ask does your script and last upload take follow pr into account.

https://github.com/ggml-org/llama.cpp/pull/19139

Owner
β€’
edited Feb 26

@engrtipusultan

no that was just merged 7 hours ago and with ik_llama.cpp there are a number of fusion optimizations built in already so am unsure that i would ever use it?

do you have opinions about it?

EDIT I'm asking them about it here: https://github.com/ggml-org/llama.cpp/pull/19139#issuecomment-3968981015

I do not have an opinion I was simply asking your opinion since you are taking both llama.cpp into account.

Sign up or log in to comment