Instructions to use dangvansam/chandra-ocr-2-FP8-dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dangvansam/chandra-ocr-2-FP8-dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="dangvansam/chandra-ocr-2-FP8-dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("dangvansam/chandra-ocr-2-FP8-dynamic")
model = AutoModelForImageTextToText.from_pretrained("dangvansam/chandra-ocr-2-FP8-dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dangvansam/chandra-ocr-2-FP8-dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dangvansam/chandra-ocr-2-FP8-dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/dangvansam/chandra-ocr-2-FP8-dynamic

SGLang

How to use dangvansam/chandra-ocr-2-FP8-dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dangvansam/chandra-ocr-2-FP8-dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dangvansam/chandra-ocr-2-FP8-dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-FP8-dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use dangvansam/chandra-ocr-2-FP8-dynamic with Docker Model Runner:
```
docker model run hf.co/dangvansam/chandra-ocr-2-FP8-dynamic
```

chandra-ocr-2 — FP8 Dynamic

FP8 dynamic-activation quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

The "works almost everywhere modern" quant. FP8 runs natively on Ada (RTX 4090 / L40S), Hopper (H100), and Blackwell. ~5434 ms/page sequential = 0.18 pages/s, 2.3× over bf16. Pick this when you don't have a Blackwell GPU, or when the runner issues one batched request per document.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.

Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - 're:.*lm_head'
        - 're:visual.*'             # keep ViT vision tower bf16
        - 're:model.visual.*'
        - 're:.*mlp.gate$'
        - 're:.*embed_tokens$'
        - 're:.*shared_expert_gate$'
        - 're:.*mlp\.shared_expert$'
        - 're:.*linear_attn.*'
      scheme: FP8_DYNAMIC

Weights: FP8 E4M3 (per-channel static scales)
Activations: FP8 dynamic (per-token scales computed at runtime, no calibration needed)
Vision tower, lm_head, MoE gates and linear_attn.* kept in bf16.

Because activations are dynamic, this quant requires no calibration dataset — accuracy ≈ upstream bf16 within OCR task noise.

Hardware requirements

GPU family	Compute capability	FP8 tensor cores	Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090)	sm_100+	✅ Native	✅
Hopper (H100/H200)	sm_90	✅ Native	✅
Ada (RTX 4090, L40S)	sm_89	✅ Native	✅
Ampere (A100/3090)	sm_80/86	Software fallback (bf16 compute)	⚠️ no speedup
Turing & older	≤ sm_75	❌	❌

vLLM ≥ 0.17 (works with the current OpenAI image). On Ada this is the only Chandra-2 quant that actually accelerates inference — NVFP4 variants have no FP4 tensor cores on Ada/Hopper.

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build	Sequential per-doc	Concurrent per-page	Best ms/page	vs bf16
bf16 baseline	12724 ms	12642 ms	12642	1.0×
FP8_DYNAMIC	5434 ms	9525 ms	5434	2.3×
NVFP4A16	12280 ms	5058 ms	5058	2.5×
NVFP4 (W4A4)	10092 ms	5794 ms	5794	2.2×

Take-away: FP8_DYNAMIC is fastest under sequential per-document batching (one big request, KV cache fully utilised). For page-level concurrent fan-out on Blackwell, switch to NVFP4A16.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-FP8-dynamic \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 64 \
  --max-num-batched-tokens 16384 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'

from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

The vision tower stays in bf16, so the upstream snippet works unchanged — just swap the repo id to dangvansam/chandra-ocr-2-FP8-dynamic. See the upstream card.

When to pick which Chandra-2 quant

Workload	Pick
Ada (RTX 4090, L40S) or Hopper (H100) GPU	FP8_DYNAMIC (this repo)
Single sequential request per doc on any modern GPU	FP8_DYNAMIC (this repo)
Page-concurrent fan-out on Blackwell	NVFP4A16
Max compression, accuracy not critical	NVFP4 (W4A4)
Reference accuracy / older hardware	upstream bf16

Files

model.safetensors — FP8-packed weights (~13 GB)
config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}

Downloads last month: 678

Safetensors

Model size

5B params

Tensor type

BF16

F8_E4M3

Model tree for dangvansam/chandra-ocr-2-FP8-dynamic

Base model

datalab-to/chandra-ocr-2

Quantized

(10)

this model