YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Bud-E Wake Word Models ("Hey Buddy")
Wake word detection models for the phrase "Hey Buddy", trained using the livekit-wakeword toolkit. These models are designed for the Bud-E voice assistant project.
Models
8 models are provided: 4 sizes (tiny, small, medium, large) x 2 languages (English, German).
English Models
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
en_tiny |
16d, 1 block | 0.0087 | 0.00 | 39.6% | 66.1% @ 0.32 | 0.32 | 119 KB |
en_small |
32d, 1 block | 0.0067 | 0.00 | 61.8% | 73.6% @ 0.36 | 0.36 | 163 KB |
en_medium |
128d, 2 blocks | 0.0062 | 0.54 | 86.8% | 79.6% @ 0.76 | 0.76 | 933 KB |
en_large |
256d, 3 blocks | 0.0038 | 1.03 | 92.3% | 83.7% @ 0.88 | 0.88 | 3.8 MB |
German Models
| Model | Size | AUT | FPPH | Recall@0.5 | Optimal Recall | Optimal Threshold | ONNX Size |
|---|---|---|---|---|---|---|---|
de_tiny |
16d, 1 block | 0.0111 | 0.00 | 25.0% | 63.4% @ 0.24 | 0.24 | 119 KB |
de_small |
32d, 1 block | 0.0097 | 0.00 | 47.6% | 67.7% @ 0.30 | 0.30 | 163 KB |
de_medium |
128d, 2 blocks | 0.0060 | 1.11 | 82.6% | 74.5% @ 0.79 | 0.79 | 933 KB |
de_large |
256d, 3 blocks | 0.0066 | 2.55 | 89.2% | 51.3% @ 0.95 | 0.95 | 3.8 MB |
Metrics:
- AUT (Area Under DET curve): Lower is better. Measures overall detection quality.
- FPPH (False Positives Per Hour): Lower is better. At threshold=0.5.
- Recall: Higher is better. Fraction of true wake words detected at threshold=0.5.
- Optimal Threshold: Threshold maximizing recall while keeping FPPH < 0.1/hr.
Architecture
All models use the conv_attention classifier architecture from livekit-wakeword:
- Input: Pre-extracted speech embeddings of shape
(batch, 16, 96)from the frozen Google speech_embedding model - Architecture: Conv1D layers + Multi-head Attention + Mean Pooling + Linear head + Sigmoid
- Output: Confidence score in
[0, 1]
The full inference pipeline is:
- Audio (16 kHz mono) β Mel spectrogram (ONNX frontend) β Speech embeddings
(N, 96)(ONNX encoder) β Pad/truncate to(16, 96)β Classifier (this model) β Score[0, 1]
The mel spectrogram and speech embedding ONNX models are bundled with the livekit-wakeword package in resources/.
Usage
With livekit-wakeword (Recommended)
pip install livekit-wakeword
from livekit.wakeword import WakeWordDetector
import numpy as np
# Load model
detector = WakeWordDetector.from_pretrained("laion/bud-e_wakeword-models_livekit-wakeword", model_name="en_large")
# Process audio (16 kHz, mono, float32)
audio = np.random.randn(32000).astype(np.float32) # 2 seconds
score = detector.detect(audio)
print(f"Wake word confidence: {score:.3f}")
# Use optimal threshold from training
if score > 0.88: # optimal_threshold for en_large
print("Wake word detected!")
Direct ONNX Inference
import onnxruntime as ort
import numpy as np
# Load classifier
session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")
# Input: pre-extracted speech embeddings (batch, 16, 96)
embeddings = np.random.randn(1, 16, 96).astype(np.float32)
# Run inference
score = session.run(["score"], {"embeddings": embeddings})[0]
print(f"Score: {score[0, 0]:.4f}")
Full Pipeline (Manual)
For custom integration without the livekit-wakeword package:
import onnxruntime as ort
import numpy as np
import librosa
# Load pipeline models (from livekit-wakeword resources/)
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("speech_embedding.onnx")
classifier_session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")
# 1. Load audio
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
# 2. Compute mel spectrogram
mel = mel_session.run(None, {"audio": audio.reshape(1, -1)})[0]
# 3. Extract embeddings
embeddings = embed_session.run(None, {"mel": mel})[0] # (N, 96)
# 4. Pad/truncate to (16, 96)
if embeddings.shape[0] >= 16:
embeddings = embeddings[-16:]
else:
pad = np.zeros((16 - embeddings.shape[0], 96), dtype=np.float32)
embeddings = np.concatenate([pad, embeddings])
# 5. Run classifier
score = classifier_session.run(
["score"],
{"embeddings": embeddings.reshape(1, 16, 96)}
)[0][0, 0]
print(f"Wake word score: {score:.4f}")
Training Data
Models were trained on synthetic speech data generated using 3 TTS backends for maximum diversity:
English
- Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test β 904 speaker voices with SLERP blending
- VoxCPM2: 3,000 positive train / 600 test β 29 voice design prompts x 4 CFG values x 3 timestep configs
- ChatterboxTTS: 2,000 positive train / 400 test β 8 reference voices with varying exaggeration/temperature
- Adversarial negatives: 4,000 train / 800 test β phonetically similar phrases ("hey body", "hey bunny", "hey baby", etc.)
- Background noise: 1,000 train / 200 test β from MUSAN
- General negatives: ACAV100M ~2000 hrs pre-extracted speech features
German
- VoxCPM2: 3,000 positive train / 600 test
- ChatterboxTTS: 2,000 positive train / 400 test
- Adversarial negatives: 3,000 train / 600 test
- Background noise: 1,000 train / 200 test
- General negatives: ACAV100M ~2000 hrs
Augmentation
- 3 rounds of compounding augmentation per clip:
- 7-band parametric EQ (25% probability)
- Tanh distortion (25% probability)
- Room impulse response convolution (50% probability, MIT RIRs)
- Background noise mixing (SNR 5-15 dB)
Training
- 3-phase adaptive training with focal loss (gamma=2.0)
- Embedding mixup regularization (alpha=0.2)
- Label smoothing (epsilon=0.05)
- Cosine warmup + decay learning rate schedule
- Negative class weight ramp from 1 to 3000
- Checkpoint averaging over best validation checkpoints
File Structure
βββ README.md
βββ configs/ # Training YAML configs
β βββ hey_buddy_en_base.yaml # EN data generation config
β βββ hey_buddy_de_base.yaml # DE data generation config
β βββ hey_buddy_{en,de}_{tiny,small,medium,large}.yaml
βββ en_tiny/
β βββ hey_buddy_en_tiny.onnx # ONNX model (119 KB)
β βββ hey_buddy_en_tiny.pt # PyTorch state dict
β βββ hey_buddy_en_tiny_eval.json # Evaluation metrics
β βββ hey_buddy_en_tiny_det.png # DET curve plot
β βββ hey_buddy_en_tiny_metrics.json
βββ en_small/
βββ en_medium/
βββ en_large/
βββ de_tiny/
βββ de_small/
βββ de_medium/
βββ de_large/
Recommended Models
- For production/edge:
en_mediumorde_mediumβ best balance of recall vs false positive rate - For quality-first:
en_largeorde_largeβ highest recall but higher FPPH - For resource-constrained:
en_smallorde_smallβ zero FPPH, moderate recall
License
Apache 2.0
Citation
@misc{bude-wakeword-2026,
title={Bud-E Wake Word Models},
author={LAION},
year={2026},
url={https://huggingface.co/laion/bud-e_wakeword-models_livekit-wakeword}
}
Acknowledgments
- livekit-wakeword toolkit by LiveKit
- VoxCPM2 TTS by OpenBMB
- ChatterboxTTS by Resemble AI
- Piper TTS
- ACAV100M speech features