Bud-E Wake Word Models ("Hey Buddy")

Wake word detection models for the phrase "Hey Buddy", trained using the livekit-wakeword toolkit. These models are designed for the Bud-E voice assistant project.

Models

8 models are provided: 4 sizes (tiny, small, medium, large) x 2 languages (English, German).

English Models

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`en_tiny`	16d, 1 block	0.0087	0.00	39.6%	66.1% @ 0.32	0.32	119 KB
`en_small`	32d, 1 block	0.0067	0.00	61.8%	73.6% @ 0.36	0.36	163 KB
`en_medium`	128d, 2 blocks	0.0062	0.54	86.8%	79.6% @ 0.76	0.76	933 KB
`en_large`	256d, 3 blocks	0.0038	1.03	92.3%	83.7% @ 0.88	0.88	3.8 MB

German Models

Model	Size	AUT	FPPH	Recall@0.5	Optimal Recall	Optimal Threshold	ONNX Size
`de_tiny`	16d, 1 block	0.0111	0.00	25.0%	63.4% @ 0.24	0.24	119 KB
`de_small`	32d, 1 block	0.0097	0.00	47.6%	67.7% @ 0.30	0.30	163 KB
`de_medium`	128d, 2 blocks	0.0060	1.11	82.6%	74.5% @ 0.79	0.79	933 KB
`de_large`	256d, 3 blocks	0.0066	2.55	89.2%	51.3% @ 0.95	0.95	3.8 MB

Metrics:

AUT (Area Under DET curve): Lower is better. Measures overall detection quality.
FPPH (False Positives Per Hour): Lower is better. At threshold=0.5.
Recall: Higher is better. Fraction of true wake words detected at threshold=0.5.
Optimal Threshold: Threshold maximizing recall while keeping FPPH < 0.1/hr.

Architecture

All models use the conv_attention classifier architecture from livekit-wakeword:

Input: Pre-extracted speech embeddings of shape (batch, 16, 96) from the frozen Google speech_embedding model
Architecture: Conv1D layers + Multi-head Attention + Mean Pooling + Linear head + Sigmoid
Output: Confidence score in [0, 1]

The full inference pipeline is:

Audio (16 kHz mono) → Mel spectrogram (ONNX frontend) → Speech embeddings (N, 96) (ONNX encoder) → Pad/truncate to (16, 96) → Classifier (this model) → Score [0, 1]

The mel spectrogram and speech embedding ONNX models are bundled with the livekit-wakeword package in resources/.

Usage

With livekit-wakeword (Recommended)

pip install livekit-wakeword

from livekit.wakeword import WakeWordDetector
import numpy as np

# Load model
detector = WakeWordDetector.from_pretrained("laion/bud-e_wakeword-models_livekit-wakeword", model_name="en_large")

# Process audio (16 kHz, mono, float32)
audio = np.random.randn(32000).astype(np.float32)  # 2 seconds
score = detector.detect(audio)
print(f"Wake word confidence: {score:.3f}")

# Use optimal threshold from training
if score > 0.88:  # optimal_threshold for en_large
    print("Wake word detected!")

Direct ONNX Inference

import onnxruntime as ort
import numpy as np

# Load classifier
session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# Input: pre-extracted speech embeddings (batch, 16, 96)
embeddings = np.random.randn(1, 16, 96).astype(np.float32)

# Run inference
score = session.run(["score"], {"embeddings": embeddings})[0]
print(f"Score: {score[0, 0]:.4f}")

Full Pipeline (Manual)

For custom integration without the livekit-wakeword package:

import onnxruntime as ort
import numpy as np
import librosa

# Load pipeline models (from livekit-wakeword resources/)
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("speech_embedding.onnx")
classifier_session = ort.InferenceSession("en_large/hey_buddy_en_large.onnx")

# 1. Load audio
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)

# 2. Compute mel spectrogram
mel = mel_session.run(None, {"audio": audio.reshape(1, -1)})[0]

# 3. Extract embeddings
embeddings = embed_session.run(None, {"mel": mel})[0]  # (N, 96)

# 4. Pad/truncate to (16, 96)
if embeddings.shape[0] >= 16:
    embeddings = embeddings[-16:]
else:
    pad = np.zeros((16 - embeddings.shape[0], 96), dtype=np.float32)
    embeddings = np.concatenate([pad, embeddings])

# 5. Run classifier
score = classifier_session.run(
    ["score"],
    {"embeddings": embeddings.reshape(1, 16, 96)}
)[0][0, 0]

print(f"Wake word score: {score:.4f}")

Training Data

Models were trained on synthetic speech data generated using 3 TTS backends for maximum diversity:

English

Piper VITS (en_US-lessac-medium): 4,000 positive train / 800 test — 904 speaker voices with SLERP blending
VoxCPM2: 3,000 positive train / 600 test — 29 voice design prompts x 4 CFG values x 3 timestep configs
ChatterboxTTS: 2,000 positive train / 400 test — 8 reference voices with varying exaggeration/temperature
Adversarial negatives: 4,000 train / 800 test — phonetically similar phrases ("hey body", "hey bunny", "hey baby", etc.)
Background noise: 1,000 train / 200 test — from MUSAN
General negatives: ACAV100M ~2000 hrs pre-extracted speech features

German

VoxCPM2: 3,000 positive train / 600 test
ChatterboxTTS: 2,000 positive train / 400 test
Adversarial negatives: 3,000 train / 600 test
Background noise: 1,000 train / 200 test
General negatives: ACAV100M ~2000 hrs

Augmentation

3 rounds of compounding augmentation per clip:
- 7-band parametric EQ (25% probability)
- Tanh distortion (25% probability)
- Room impulse response convolution (50% probability, MIT RIRs)
- Background noise mixing (SNR 5-15 dB)

Training

3-phase adaptive training with focal loss (gamma=2.0)
Embedding mixup regularization (alpha=0.2)
Label smoothing (epsilon=0.05)
Cosine warmup + decay learning rate schedule
Negative class weight ramp from 1 to 3000
Checkpoint averaging over best validation checkpoints

File Structure

├── README.md
├── configs/                    # Training YAML configs
│   ├── hey_buddy_en_base.yaml  # EN data generation config
│   ├── hey_buddy_de_base.yaml  # DE data generation config
│   └── hey_buddy_{en,de}_{tiny,small,medium,large}.yaml
├── en_tiny/
│   ├── hey_buddy_en_tiny.onnx       # ONNX model (119 KB)
│   ├── hey_buddy_en_tiny.pt         # PyTorch state dict
│   ├── hey_buddy_en_tiny_eval.json  # Evaluation metrics
│   ├── hey_buddy_en_tiny_det.png    # DET curve plot
│   └── hey_buddy_en_tiny_metrics.json
├── en_small/
├── en_medium/
├── en_large/
├── de_tiny/
├── de_small/
├── de_medium/
└── de_large/

Recommended Models

For production/edge: en_medium or de_medium — best balance of recall vs false positive rate
For quality-first: en_large or de_large — highest recall but higher FPPH
For resource-constrained: en_small or de_small — zero FPPH, moderate recall

License

Apache 2.0

Citation

@misc{bude-wakeword-2026,
  title={Bud-E Wake Word Models},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/bud-e_wakeword-models_livekit-wakeword}
}

Acknowledgments

livekit-wakeword toolkit by LiveKit
VoxCPM2 TTS by OpenBMB
ChatterboxTTS by Resemble AI
Piper TTS
ACAV100M speech features

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support