Voice Shield · kNN-VC HiFi-GAN

Speaker anonymization checkpoint used by the Voice Shield panel in MAESTRO's AI Workstation (Tools → Voice Shield → Anonymize tab).

What this is

The prematched HiFi-GAN generator from bshall/knn-vc, repackaged as a single .safetensors file. Trained to vocode 1024-dim WavLM-Large layer-6 features back to 16 kHz audio. Pairs with microsoft/wavlm-large (used upstream from the HuggingFace Hub directly — no need for a separate mirror).

Architecture: HiFi-GAN with a lin_pre Linear(1024→512) prefix
Parameters: ~16.5 M
Input: WavLM-Large layer-6 hidden states [B, T, 1024]
Output: 16 kHz waveform [B, samples] (320× upsample)
License: BSD-3-Clause (original kNN-VC license preserved)

What it's used for in MAESTRO

The Voice Shield panel offers voice-cloning protection with three threat models. The Anonymize tab uses this checkpoint to transform a user's voice into a different synthetic speaker via the kNN-VC pipeline:

Extract WavLM-Large layer-6 features from the user's voice.
For each frame, find the k=4 nearest matches in a "target speaker" feature pool and average them.
Vocode the matched features back to audio with this HiFi-GAN.

The output sounds like the target speaker — so any voice-cloning model trained on the output learns the target's identity, not the user's. This is the only voice-protection paradigm currently robust against adversarial-perturbation strippers like LightShed.

Loading

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

path = hf_hub_download(repo_id="AEmotionStudio/voiceshield-targets",
                       filename="hifigan_knnvc.safetensors")
state = load_file(path)
# state-dict keys match the bshall prematch_g_02500000.pt schema.

MAESTRO loads this through the vendor module at backend/ai/voiceshield/knn_vc_vendor/hifigan.py.

Credits

Original kNN-VC: Matthew Baas, Benjamin van Niekerk, Herman Kamper — "Voice Conversion With Just Nearest Neighbors", Interspeech 2023. Code: https://github.com/bshall/knn-vc.
WavLM-Large: Microsoft (https://huggingface.co/microsoft/wavlm-large).

Honest framing

Anonymization is a one-way transform. The original speaker identity is unrecoverable from the output — that's the point, and the trade-off the user opts into.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AEmotionStudio/voiceshield-targets

Voice Conversion With Just Nearest Neighbors

Paper • 2305.18975 • Published May 30, 2023