MarbleNet VAD -- GGUF

GGUF conversion of nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0 for use with CrispStrobe/CrispASR.

Available variants

File Size Notes
marblenet-vad.gguf 439 KB F32, BatchNorm fused into conv weights

No quantization needed โ€” model is already 91.5K params (439 KB).

Model details

  • Architecture: MarbleNet โ€” 1D time-channel separable CNN (6 Jasper blocks: depthwise conv + pointwise conv + BN + ReLU)
  • Parameters: 91.5K (smallest VAD model in CrispASR)
  • Languages: Chinese, English, French, German, Russian, Spanish
  • Input: 80-bin mel spectrogram (16kHz, 512 FFT, 25ms window, 10ms stride)
  • Output: per-frame speech probability (20ms per frame)
  • Training data: 2,600h real + 1,000h synthetic + 330h noise
  • License: NVIDIA Open Model License (commercial use OK)

Benchmark (ROC-AUC from NVIDIA)

Dataset ROC-AUC
VoxConverse-test 96.65
VoxConverse-dev 97.59
AMI-test 96.25
Earnings21 97.11

Usage with CrispASR

# Auto-download (439 KB)
crispasr --backend whisper -m auto --auto-download --vad -vm marblenet -f audio.wav

# Or with explicit path
crispasr --backend parakeet -m auto --auto-download --vad -vm marblenet-vad.gguf -f audio.wav

CrispASR VAD comparison

VAD Model Size Latency Languages
Silero VAD v5 0.9 MB ~60 ms Multilingual
MarbleNet 0.4 MB ~30 ms 6 languages
FireRedVAD 2.4 MB ~50 ms 100+ (recommended)
Whisper-VAD-ASMR 22 MB ~1000 ms Experimental

Conversion

python models/convert-marblenet-vad-to-gguf.py \
  --input nvidia/Frame_VAD_Multilingual_MarbleNet_v2.0 \
  --output marblenet-vad.gguf

BatchNorm layers are fused into convolution weights at convert time (36 tensors from 84 original).

Downloads last month
105
GGUF
Model size
111k params
Architecture
marblenet_vad
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/marblenet-vad-GGUF

Quantized
(2)
this model