YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Bud-E Wake Word Models ("Hey Buddy")

Wake word detection models for the phrase "Hey Buddy", trained using the livekit-wakeword toolkit. These models are designed for the Bud-E voice assistant project.

Models

8 models are provided: 4 sizes (tiny, small, medium, large) x 2 languages (English, German).

English Models

Model Size AUT FPPH Recall@0.5 Optimal Recall Optimal Threshold ONNX Size
en_tiny 16d, 1 block 0.0087 0.00 39.6% 66.1% @ 0.32 0.32 119 KB
en_small 32d, 1 block 0.0067 0.00 61.8% 73.6% @ 0.36 0.36 163 KB
en_medium 128d, 2 blocks 0.0062 0.54 86.8% 79.6% @ 0.76 0.76 933 KB
en_large 256d, 3 blocks 0.0038 1.03 92.3% 83.7% @ 0.88 0.88 3.8 MB

German Models

Model Size AUT FPPH Recall@0.5 Optimal Recall Optimal Threshold ONNX Size
de_tiny 16d, 1 block 0.0111 0.00 25.0% 63.4% @ 0.24 0.24 119 KB
de_small 32d, 1 block 0.0097 0.00 47.6% 67.7% @ 0.30 0.30 163 KB
de_medium 128d, 2 blocks 0.0060 1.11 82.6% 74.5% @ 0.79 0.79 933 KB
de_large 256d, 3 blocks 0.0066 2.55 89.2% 51.3% @ 0.95 0.95 3.8 MB

Metrics:

  • AUT (Area Under DET curve): Lower is better. Measures overall detection quality.
  • FPPH (False Positives Per Hour): Lower is better. At threshold=0.5.
  • Recall: Higher is better. Fraction of true wake words detected at threshold=0.5.
  • Optimal Threshold: Threshold maximizing recall while keeping FPPH < 0.1/hr.

Architecture

All models use the conv_attention classifier architecture from livekit-wakeword:

  • Input: Pre-extracted speech embeddings of shape (batch, 16, 96) from the frozen Google speech_embedding model
  • Architecture: Conv1D layers + Multi-head Attention + Mean Pooling + Linear head + Sigmoid
  • Output: Confidence score in [0, 1]

The full inference pipeline is:

  1. Audio (16 kHz mono) β†’ Mel spectrogram (ONNX frontend) β†’ Speech embeddings (N, 96) (ONNX encoder) β†’ Pad/truncate to (16, 96) β†’ Classifier (this model) β†’ Score [0, 1]

The mel spectrogram and speech embedding ONNX models are bundled with the livekit-wakeword package in resources/.

Usage

With livekit-wakeword (Recommended)

pip install livekit-wakeword
from livekit.wakeword import WakeWordDetector
import numpy as np

# Load model
detector = WakeWordDetector.from_pretrained("laion/bud-e_wakeword-models_livekit-wakeword", model_name="en_large")

# Process audio (16 kHz, mono, float32)
audio = np.random.randn(32000).astype(np.float32)  # 2 seconds
score = detector.detect(audio)
print(f"Wake word confidence: {score:.3f}")

# Use optimal threshold from training
if score > 0.88:  # optimal_threshold for en_large
    print("Wake word detected!")

Direct ONNX Inference

import onnxruntime as ort
import numpy as np

# Load classifier
session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# Input: pre-extracted speech embeddings (batch, 16, 96)
embeddings = np.random.randn(1, 16, 96).astype(np.float32)

# Run inference
score = session.run(["score"], {"embeddings": embeddings})[0]
print(f"Score: {score[0, 0]:.4f}")

Full Pipeline (Manual)

For custom integration without the livekit-wakeword package:

import onnxruntime as ort
import numpy as np
import librosa

# Load pipeline models (from livekit-wakeword resources/)
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("speech_embedding.onnx")
classifier_session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# 1. Load audio
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)

# 2. Compute mel spectrogram
mel = mel_session.run(None, {"audio": audio.reshape(1, -1)})[0]

# 3. Extract embeddings
embeddings = embed_session.run(None, {"mel": mel})[0]  # (N, 96)

# 4. Pad/truncate to (16, 96)
if embeddings.shape[0] >= 16:
    embeddings = embeddings[-16:]
else:
    pad = np.zeros((16 - embeddings.shape[0], 96), dtype=np.float32)
    embeddings = np.concatenate([pad, embeddings])

# 5. Run classifier
score = classifier_session.run(
    ["score"],
    {"embeddings": embeddings.reshape(1, 16, 96)}
)[0][0, 0]

print(f"Wake word score: {score:.4f}")

Training Data

Models were trained on synthetic speech data generated using 3 TTS backends for maximum diversity:

English

  • Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test β€” 904 speaker voices with SLERP blending
  • VoxCPM2: 3,000 positive train / 600 test β€” 29 voice design prompts x 4 CFG values x 3 timestep configs
  • ChatterboxTTS: 2,000 positive train / 400 test β€” 8 reference voices with varying exaggeration/temperature
  • Adversarial negatives: 4,000 train / 800 test β€” phonetically similar phrases ("hey body", "hey bunny", "hey baby", etc.)
  • Background noise: 1,000 train / 200 test β€” from MUSAN
  • General negatives: ACAV100M ~2000 hrs pre-extracted speech features

German

  • VoxCPM2: 3,000 positive train / 600 test
  • ChatterboxTTS: 2,000 positive train / 400 test
  • Adversarial negatives: 3,000 train / 600 test
  • Background noise: 1,000 train / 200 test
  • General negatives: ACAV100M ~2000 hrs

Augmentation

  • 3 rounds of compounding augmentation per clip:
    • 7-band parametric EQ (25% probability)
    • Tanh distortion (25% probability)
    • Room impulse response convolution (50% probability, MIT RIRs)
    • Background noise mixing (SNR 5-15 dB)

Training

  • 3-phase adaptive training with focal loss (gamma=2.0)
  • Embedding mixup regularization (alpha=0.2)
  • Label smoothing (epsilon=0.05)
  • Cosine warmup + decay learning rate schedule
  • Negative class weight ramp from 1 to 3000
  • Checkpoint averaging over best validation checkpoints

File Structure

β”œβ”€β”€ README.md
β”œβ”€β”€ configs/                    # Training YAML configs
β”‚   β”œβ”€β”€ hey_buddy_en_base.yaml  # EN data generation config
β”‚   β”œβ”€β”€ hey_buddy_de_base.yaml  # DE data generation config
β”‚   └── hey_buddy_{en,de}_{tiny,small,medium,large}.yaml
β”œβ”€β”€ en_tiny/
β”‚   β”œβ”€β”€ hey_buddy_en_tiny.onnx       # ONNX model (119 KB)
β”‚   β”œβ”€β”€ hey_buddy_en_tiny.pt         # PyTorch state dict
β”‚   β”œβ”€β”€ hey_buddy_en_tiny_eval.json  # Evaluation metrics
β”‚   β”œβ”€β”€ hey_buddy_en_tiny_det.png    # DET curve plot
β”‚   └── hey_buddy_en_tiny_metrics.json
β”œβ”€β”€ en_small/
β”œβ”€β”€ en_medium/
β”œβ”€β”€ en_large/
β”œβ”€β”€ de_tiny/
β”œβ”€β”€ de_small/
β”œβ”€β”€ de_medium/
└── de_large/

Recommended Models

  • For production/edge: en_medium or de_medium β€” best balance of recall vs false positive rate
  • For quality-first: en_large or de_large β€” highest recall but higher FPPH
  • For resource-constrained: en_small or de_small β€” zero FPPH, moderate recall

License

Apache 2.0

Citation

@misc{bude-wakeword-2026,
  title={Bud-E Wake Word Models},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/bud-e_wakeword-models_livekit-wakeword}
}

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support