Infinity-Parser2-Flash — AWQ W4A16 (int4)
An int4 (W4A16) AWQ quantization of infly/Infinity-Parser2-Flash, made to run on NVIDIA A100 / sm80, where the original FP8 path is unsupported. 4.2 GB bf16 → **2.9 GB**.
Method
AWQ via llm-compressor, routed experts only (W4A16). Attention, GDN/linear-attention, shared expert, vision tower, and lm_head are kept in bf16. Calibrated on a few hundred diverse document + general-vision samples.
Serving note (important)
vLLM fuses some bf16 layers before consulting the ignore list, which can otherwise yield all-! output. Fix: the saved config.json quantization_config.ignore uses broad regexes matching the fused names. Already applied here.
Quality (VLMEvalKit, AI-judged reproduction)
| Benchmark | Published bf16 | This int4 |
|---|---|---|
| MMStar | 57.1 | 54.8 |
| OCRBench | 81.6 | 85.0 |
| DocVQA (val) | 93.2 | 93.5 |
Near-lossless on the cleanly-comparable axes. (MMBench omitted to avoid a circular-vs-vanilla scoring mismatch.)
Usage (vLLM)
vllm serve spectator2026/Infinity-Parser2-Flash-AWQ-W4A16 --dtype bfloat16 --trust-remote-code --reasoning-parser qwen3
Pass chat_template_kwargs={"enable_thinking": false} in requests, or answers land in the reasoning channel.
Quantized by @spectator2026. Original model © infly, Apache-2.0.
- Downloads last month
- 13
Model tree for spectator2026/Infinity-Parser2-Flash-AWQ-W4A16
Base model
infly/Infinity-Parser2-Flash