YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Frame-wise speaker embeddings
This repository contains a standalone Python interface for the Frame-wise and geodesic teacher/student speaker embedding checkpoints.
Files:
ckpt_frame_overlap_robust_best_EER.pth: best-EER checkpoint from the frame-wise and overlap-robust student run.ckpt_geodesic_interpolation_best_EER.pth: best-EER checkpoint from the geodesic interpolation student run.frame_wise_speaker_embeddings.py: self-contained loader and inference code for the teacher and student. It auto-detects which checkpoint variant is loaded.
The teacher produces utterance-level 256-dimensional embeddings. The student
produces frame-wise embeddings in the same 256-dimensional embedding space as
the teacher. For ckpt_geodesic_interpolation_best_EER.pth, the student is
64-dimensional internally and is projected with the checkpoint's
reduction_layer. For ckpt_frame_overlap_robust_best_EER.pth, the student is
directly 256-dimensional and no projection layer is used.
Usage
git clone https://huggingface.co/boeddeker/frame_wise_speaker_embeddings
cd frame_wise_speaker_embeddings
wget -O ckpt_frame_overlap_robust_best_EER.pth \
"https://huggingface.co/boeddeker/frame_wise_speaker_embeddings/resolve/main/ckpt_frame_overlap_robust_best_EER.pth?download=true"
wget -O ckpt_geodesic_interpolation_best_EER.pth \
"https://huggingface.co/boeddeker/frame_wise_speaker_embeddings/resolve/main/ckpt_geodesic_interpolation_best_EER.pth?download=true"
# git lfs pull # may be an alternative to wget, but command is untested
import numpy as np
from frame_wise_speaker_embeddings import load_frame_wise_speaker_embeddings
model = load_frame_wise_speaker_embeddings("ckpt_geodesic_interpolation_best_EER.pth")
# or:
# model = load_frame_wise_speaker_embeddings("ckpt_frame_overlap_robust_best_EER.pth")
audio = np.zeros(32000, dtype=np.float32) # 16 kHz waveform
teacher = model.teacher_embedding(audio) # shape: (256,)
student = model.student_embedding(audio) # shape: (256,)
student_frames = model.student_frame_embeddings(audio) # shape: (T, 256)
out = model.extract(audio)
teacher = out["teacher"]
student = out["student"]
student_frames = out["student_frames"]
Batch input is supported as a NumPy array with shape (batch, samples).
The expected sample rate is 16 kHz.
Dependencies
pip install torchpip install numpy einops paderbox padertorch
Cite
If you use a checkpoint, please cite the corresponding paper
Frame-wise and overlap-robust speaker embeddings for meeting diarization:
@inproceedings{cord2023frame,
title={Frame-wise and overlap-robust speaker embeddings for meeting diarization},
author={Cord-Landwehr, Tobias and Boeddeker, Christoph and Zoril{\u{a}}, C{\u{a}}t{\u{a}}lin and Doddipatla, Rama and Haeb-Umbach, Reinhold},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}
Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios:
@inproceedings{cord2024geodesic,
title={Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios},
author={Cord-Landwehr, Tobias and Boeddeker, Christoph and Zoril{\u{a}}, C{\u{a}}t{\u{a}}lin and Doddipatla, Rama and Haeb-Umbach, Reinhold},
booktitle={2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={11886--11890},
year={2024},
organization={IEEE}
}