What is difference between fp8 scaled and mxfp8?

#4
by shivshankar - opened

Do 4090 support both?

MXFP8 is Blackwell only.

Comfy Org org

MXFP8 is block wise scaled, specifically with hardware level support in latest NVIDIA GPUs (Blackwell), it's of better quality.

On a 4090 this would mean it has to be dequanted on the fly, making it slower to run, but you will get the quality benefits regardless.

On a 50xx GPU it would be around ~30% faster than bf16.

How quality difference is there? Identical quality?

@shivshankar Mxfp8 uses a block size of 32 (meaning every 32 quantized weights share a single scale factor), while scaled fp8 generally only has a single scale for the entire tensor (so there is 1 scale for maybe about 9 million weights for a 3072x3072 tensor), or a single scale for each tensor row (so maybe 1 scale per 3072 weights). Having more granular scales definitely increases quality.

However, in my personal opinion, I believe that for mxfp8, the quality increase should not be worth the size increase over scaled fp8

@shivshankar , @mingyi456

I compared the image quality and generation speed of mxfp8 and fp8_scaled on an RTX 4060 Ti.

mxfp8 may not be the best choice for the RTX 4090.

@easygoing0114 Thanks for the work. The fp8 scaled quant in this repo uses a single tensor-wise scale, do you know how to create a row-wise scaled fp8 quant? That should be closer to mxfp8 in quality.

@mingyi456 Thanks for the explanation. I used the convert_to_quant library to create both fp8_scaled and mxfp8. The fp8_scaled variant is probably tensor-wise, and honestly, I'm not sure how to achieve row-wise scaling with it.

Sign up or log in to comment