PaddleOCR-VL-1.5 — NVFP4 (modelopt checkpoint)

NVFP4 quantization of PaddleOCR-VL-1.5 (multimodal vision-language model for OCR), produced via nvidia-modelopt. Hardware-agnostic checkpoint; inference requires NVIDIA Blackwell GPUs (RTX 50xx, B100/B200, GB200) via TensorRT-LLM.

What's in this repo

config.json                # PaddleOCR-VL config + NVFP4 quantization metadata
generation_config.json
preprocessor_config.json   # vision preprocessing (resize, normalize)
processor_config.json      # multimodal processor binding
tokenizer.json + tokenizer_config.json + chat_template.jinja
model.safetensors          # NVFP4 weights (vision + decoder + cross-attn) (~600 MB - 1 GB)

This is the modelopt checkpoint, not a TRT-LLM engine.

Build the engine on your Blackwell GPU

# Download
git lfs install
git clone https://huggingface.co/tss-deposium/PaddleOCR-VL-1.5-nvfp4
cd PaddleOCR-VL-1.5-nvfp4

# Validate before the full build (~30s) — cheap signal for compatibility
trtllm-build --checkpoint_dir . --output_dir /tmp/dryrun \
    --dry_run --log_level debug

# Full engine build (5-15 min on RTX 50xx — smaller than Gemma 4)
# Note: VLM engines may need separate vision + decoder builds depending on
# TRT-LLM version. Check the trtllm-build --help for `--workers` / `--multimodal` flags.
trtllm-build --checkpoint_dir . \
    --output_dir ./engine \
    --gemm_plugin nvfp4 \
    --max_batch_size 4 --max_input_len 4096 --max_seq_len 5120 \
    --use_paged_context_fmha enable

Inference

from tensorrt_llm.runtime import MultimodalModelRunner
from transformers import AutoProcessor
from PIL import Image

processor = AutoProcessor.from_pretrained("tss-deposium/PaddleOCR-VL-1.5-nvfp4")
runner = MultimodalModelRunner.from_dir("./engine")

image = Image.open("page.png")
prompt = "OCR with format:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
out = runner.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Caveats — read before adopting

Blackwell required: NVFP4 is hardware-accelerated only on RTX 50xx, B100/B200, GB200. On older GPUs, FP4 ops fall back to FP16 simulation, losing the speedup.
Multimodal NVFP4 is experimental: as of 2026-05, PaddleOCR-VL is not in NVIDIA's official NVFP4 support matrix. The vision tower may have been left FP16 (depends on modelopt's adapter — check the export logs in the source notebook). modelopt + trtllm-build may regress on future updates.
OCR quality: tested on synthetic + ~30 multi-domain calibration samples. If your input distribution differs significantly (medical forms, handwriting, non-Latin scripts beyond what's in the calibration corpus), recalibrate from FP16 with your own corpus.
Format mobility: NVFP4 checkpoint format may change between modelopt minor releases. Pin your modelopt version to match this checkpoint's source notebook (see Provenance below).

When to use this vs LFM2.5-VL-450M-ONNX

	This repo (NVFP4)	LFM2.5-VL-450M-ONNX
Hardware	Blackwell only	Any GPU + CPU
Stack	TensorRT-LLM	ONNX Runtime
Vitesse	~1.5-3× faster (decoder) on Blackwell	baseline
OCR coverage	PaddleOCR-VL training (broad doc layouts, FR/EN/multilang)	LFM2.5 training (image captioning + OCR-as-side-task)
Portabilité	Self-hosted RTX 50xx only	Linux/Docker/Railway/cross-OS
Quality	~97-99% on calibration corpus	Proven (LiquidAI bench)

If you're not on Blackwell, or you need cross-platform / Railway deployment, use LFM2.5-VL-450M-ONNX (deposium-inference path).

Provenance

Author: Nicolas Geysse — The Seed Ship (Deposium project, theseedship/deposium-turbov3)
Source model: PaddlePaddle/PaddleOCR-VL-1.5 (Apache-2.0)
Quantization: nvidia-modelopt NVFP4_DEFAULT_CFG
Pipeline: docs/paddleocr_vl_1_5_nvfp4_modelopt_export.ipynb
License: Apache-2.0 (inherited from base model)

Downloads last month: 20

Safetensors

Model size

0.6B params

Tensor type

F16

F8_E4M3

Model tree for tss-deposium/PaddleOCR-VL-1.5-nvfp4

Base model

baidu/ERNIE-4.5-0.3B-Paddle

Finetuned

PaddlePaddle/PaddleOCR-VL-1.5

Quantized

(9)

this model