IBM Granite Speech 4.1 (2B) β€” ONNX export for Vernacula

ONNX export of ibm-granite/granite-speech-4.1-2b, a speech-to-text model combining a 16-layer Conformer acoustic encoder, a BLIP-2 Q-Former projector, and a 1.84 B-parameter Granite-4.0 LLM decoder. Packaged for use with Vernacula and ONNX Runtime.

Highlights

  • 23Γ— realtime on RTX 3090 (RTF 0.043) on a 10-min English clip with VAD-segmented input. Achieved via VRAM-budgeted batching (B ≀ 16, cudaMemGetInfo-driven), the chained Run-with-OrtValue decode pattern that keeps KV GPU-resident across steps without OrtIoBinding, and a per-row periodic-loop detector that forces EOS on runaway rows so stragglers do not drag finished batch-mates to the max_new_tokens cap.
  • Unified prefill+step decoder graph. Prefill is signalled by passing zero-length past-KV inputs and cache_position=[0..S-1]; step is signalled by populated past-KV and cache_position=[past_len]. One graph, one set of weights β€” eliminates the duplicate 7 GB LM copy that an init/step split would require, while preserving exact fp32 parity with the upstream forward.
  • Encoder rewritten to full attention. The upstream block-attention encoder cannot be made symbolic for ONNX export. The exporter replaces it with a mathematically equivalent full-attention pass guarded by a regression test (test_encoder_math_equivalence.py) β€” max-abs-diff vs the original block path is ~1e-5 (fp32 noise).
  • Three trace-time patches for torch.onnx.export(dynamo=True): 5-D SDPA replaced with a manual softmax-attention math, attn_implementation="eager" forced through the LM, and the audio-token masked_scatter replaced with a cumsum + gather + where pattern that drops the need for an explicit audio mask and keeps the merge symbolic.
  • fp32 throughout. A fp16 LM lost ~1 token of greedy parity on long contexts. Decoder unification recovered that loss while halving resident weight memory, so fp32 is the shipping precision.

Contents

File Purpose
mel.onnx Log-Mel spectrogram frontend (16 kHz waveform β†’ mel features)
encoder.onnx (+ .data) Conformer acoustic encoder, full-attention rewrite
projector.onnx BLIP-2 Q-Former audio-to-text projector (encoder hidden β†’ 2048-D audio embeds)
decoder.onnx (+ .data) Unified prefill + step Granite-4.0 LM decoder with cumsum-gather audio merge
tokenizer.json, vocab.json, merges.txt, added_tokens.json, special_tokens_map.json, tokenizer_config.json GPT-2-family ByteLevel BPE tokenizer assets
chat_template.jinja Chat template (used to construct the ASR prompt)
preprocessor_config.json, processor_config.json Mel/preprocessor parameters from upstream
export-report.json Per-stage export timings + dtype/opset metadata
manifest.json Per-file MD5 hashes (used by Vernacula's download verifier)

Export provenance

Exported via scripts/granite_export/ in the Vernacula repo using the dynamo-based torch.onnx.export path at opset 18, dtype fp32. The decoder is a single graph that handles both prefill and step via empty vs populated past-KV inputs; runtime sees the same weights for the whole autoregressive trajectory.

The export and parity work, including the three trace-time patches and the encoder full-attention rewrite, is logged step-by-step in docs/dev/granite_speech_investigation.md; the perf iteration loop (chained Run-with-OrtValue, batching, runaway detection) is in docs/dev/granite_speech_perf_investigation.md.

Performance

Measured on a local RTX 3090 with Vernacula's batched ONNX pipeline:

Workload Mode Wall ASR RTF
6.4 s VCTK clip B=1 3.25 s 0.508
25 s, 4 VCTK speakers B=4 3.66 s 0.147
90 s VAD-segmented mixed B (≀16) 7.80 s 0.091
600 s (10 min) VAD-segmented, 159 segs mixed B (≀16) 23.9 s 0.043

For comparison, on the same 600 s clip Vernacula's serial Qwen3-ASR-1.7B backend lands RTF β‰ˆ 0.075; Granite Speech 4.1 batched is ~1.7Γ— faster end-to-end.

License

Apache-2.0, inherited from the upstream ibm-granite/granite-speech-4.1-2b model. Permissive β€” commercial use is allowed with attribution.

Using these files

In Vernacula, select Granite Speech as the ASR backend (--asr granite on the CLI) and the package will be downloaded and verified against manifest.json automatically. Outside Vernacula, pull with huggingface_hub and load the four graphs with onnxruntime:

from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="christopherthompson81/granite-speech-4-1-2b-onnx")

The unified decoder follows a single contract for both prefill and step β€” see scripts/granite_export/ for the input/output tensor names, the audio-token-count formula (mel_length / 2 β†’ projector blocks β†’ 3 audio tokens / block), and the ASR prompt template.

Limitations

Numerical behavior matches the upstream ibm-granite/granite-speech-4.1-2b model. Language coverage (English, French, German, Spanish, Portuguese, Japanese), accuracy, and known failure modes inherit from the upstream model card.

Specific to this ONNX repackaging:

  • Greedy-decode loops on near-silence segments. On very short or near-silent VAD chunks the model can fall into a fixed short-period token loop ("uh, uh, uh, …"). Vernacula mitigates this with a runtime detector that forces EOS on the affected row after 3 cycles of any period in [1..4]. Without that mitigation, a single runaway row drags the whole batch to max_new_tokens. If you call the ONNX decoder directly, replicate the detector or pre-filter out chunks under ~0.3 s of speech.
  • No speaker attribution or word-level timestamps. Those features are in the upstream granite-speech-4.1-2b-plus variant, which requires a different output graph and is not exported here yet.
  • Segments capped at ~6.7 minutes of audio per call by the decoder's 4096-token context (with max_new_tokens=256). Pre-segment longer audio with VAD or diarization.

Citation

@misc{granite-speech-4-1,
    author       = { IBM Granite Team },
    title        = { Granite Speech 4.1 (2B) },
    year         = 2025,
    url          = { https://huggingface.co/ibm-granite/granite-speech-4.1-2b },
    publisher    = { Hugging Face }
}

Acknowledgments

Issues with the ONNX export specifically: open an issue on the Vernacula repo. Issues with the underlying model: see the upstream ibm-granite/granite-speech-4.1-2b model card.

See also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for christopherthompson81/granite-speech-4-1-2b-onnx

Quantized
(4)
this model