chandra-ocr-2 — FP8 Dynamic

FP8 dynamic-activation quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

The "works almost everywhere modern" quant. FP8 runs natively on Ada (RTX 4090 / L40S), Hopper (H100), and Blackwell. ~5434 ms/page sequential = 0.18 pages/s, 2.3× over bf16. Pick this when you don't have a Blackwell GPU, or when the runner issues one batched request per document.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.


Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - 're:.*lm_head'
        - 're:visual.*'             # keep ViT vision tower bf16
        - 're:model.visual.*'
        - 're:.*mlp.gate$'
        - 're:.*embed_tokens$'
        - 're:.*shared_expert_gate$'
        - 're:.*mlp\.shared_expert$'
        - 're:.*linear_attn.*'
      scheme: FP8_DYNAMIC
  • Weights: FP8 E4M3 (per-channel static scales)
  • Activations: FP8 dynamic (per-token scales computed at runtime, no calibration needed)
  • Vision tower, lm_head, MoE gates and linear_attn.* kept in bf16.

Because activations are dynamic, this quant requires no calibration dataset — accuracy ≈ upstream bf16 within OCR task noise.

Hardware requirements

GPU family Compute capability FP8 tensor cores Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090) sm_100+ ✅ Native
Hopper (H100/H200) sm_90 ✅ Native
Ada (RTX 4090, L40S) sm_89 ✅ Native
Ampere (A100/3090) sm_80/86 Software fallback (bf16 compute) ⚠️ no speedup
Turing & older ≤ sm_75

vLLM ≥ 0.17 (works with the current OpenAI image). On Ada this is the only Chandra-2 quant that actually accelerates inference — NVFP4 variants have no FP4 tensor cores on Ada/Hopper.

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build Sequential per-doc Concurrent per-page Best ms/page vs bf16
bf16 baseline 12724 ms 12642 ms 12642 1.0×
FP8_DYNAMIC 5434 ms 9525 ms 5434 2.3×
NVFP4A16 12280 ms 5058 ms 5058 2.5×
NVFP4 (W4A4) 10092 ms 5794 ms 5794 2.2×

Take-away: FP8_DYNAMIC is fastest under sequential per-document batching (one big request, KV cache fully utilised). For page-level concurrent fan-out on Blackwell, switch to NVFP4A16.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-FP8-dynamic \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'
from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

The vision tower stays in bf16, so the upstream snippet works unchanged — just swap the repo id to dangvansam/chandra-ocr-2-FP8-dynamic. See the upstream card.

When to pick which Chandra-2 quant

Workload Pick
Ada (RTX 4090, L40S) or Hopper (H100) GPU FP8_DYNAMIC (this repo)
Single sequential request per doc on any modern GPU FP8_DYNAMIC (this repo)
Page-concurrent fan-out on Blackwell NVFP4A16
Max compression, accuracy not critical NVFP4 (W4A4)
Reference accuracy / older hardware upstream bf16

Files

  • model.safetensors — FP8-packed weights (~13 GB)
  • config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
  • recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}
Downloads last month
678
Safetensors
Model size
5B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dangvansam/chandra-ocr-2-FP8-dynamic

Quantized
(10)
this model