Frame-wise speaker embeddings

This repository contains a standalone Python interface for the Frame-wise and geodesic teacher/student speaker embedding checkpoints.

Files:

ckpt_frame_overlap_robust_best_EER.pth: best-EER checkpoint from the frame-wise and overlap-robust student run.
ckpt_geodesic_interpolation_best_EER.pth: best-EER checkpoint from the geodesic interpolation student run.
frame_wise_speaker_embeddings.py: self-contained loader and inference code for the teacher and student. It auto-detects which checkpoint variant is loaded.

The teacher produces utterance-level 256-dimensional embeddings. The student produces frame-wise embeddings in the same 256-dimensional embedding space as the teacher. For ckpt_geodesic_interpolation_best_EER.pth, the student is 64-dimensional internally and is projected with the checkpoint's reduction_layer. For ckpt_frame_overlap_robust_best_EER.pth, the student is directly 256-dimensional and no projection layer is used.

Usage

git clone https://huggingface.co/boeddeker/frame_wise_speaker_embeddings
cd frame_wise_speaker_embeddings
wget -O ckpt_frame_overlap_robust_best_EER.pth \
  "https://huggingface.co/boeddeker/frame_wise_speaker_embeddings/resolve/main/ckpt_frame_overlap_robust_best_EER.pth?download=true"
wget -O ckpt_geodesic_interpolation_best_EER.pth \
  "https://huggingface.co/boeddeker/frame_wise_speaker_embeddings/resolve/main/ckpt_geodesic_interpolation_best_EER.pth?download=true"
# git lfs pull   # may be an alternative to wget, but command is untested

import numpy as np
from frame_wise_speaker_embeddings import load_frame_wise_speaker_embeddings

model = load_frame_wise_speaker_embeddings("ckpt_geodesic_interpolation_best_EER.pth")
# or:
# model = load_frame_wise_speaker_embeddings("ckpt_frame_overlap_robust_best_EER.pth")

audio = np.zeros(32000, dtype=np.float32)  # 16 kHz waveform

teacher = model.teacher_embedding(audio)          # shape: (256,)
student = model.student_embedding(audio)          # shape: (256,)
student_frames = model.student_frame_embeddings(audio)  # shape: (T, 256)

out = model.extract(audio)
teacher = out["teacher"]
student = out["student"]
student_frames = out["student_frames"]

Batch input is supported as a NumPy array with shape (batch, samples).

The expected sample rate is 16 kHz.

Dependencies

pip install torch
pip install numpy einops paderbox padertorch

Cite

If you use a checkpoint, please cite the corresponding paper

Frame-wise and overlap-robust speaker embeddings for meeting diarization:

@inproceedings{cord2023frame,
  title={Frame-wise and overlap-robust speaker embeddings for meeting diarization},
  author={Cord-Landwehr, Tobias and Boeddeker, Christoph and Zoril{\u{a}}, C{\u{a}}t{\u{a}}lin and Doddipatla, Rama and Haeb-Umbach, Reinhold},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios:

@inproceedings{cord2024geodesic,
  title={Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios},
  author={Cord-Landwehr, Tobias and Boeddeker, Christoph and Zoril{\u{a}}, C{\u{a}}t{\u{a}}lin and Doddipatla, Rama and Haeb-Umbach, Reinhold},
  booktitle={2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={11886--11890},
  year={2024},
  organization={IEEE}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for boeddeker/frame_wise_speaker_embeddings

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Paper • 2401.03963 • Published Jan 8, 2024

Frame-wise and overlap-robust speaker embeddings for meeting diarization

Paper • 2306.00625 • Published Jun 1, 2023