Voice Shield · kNN-VC HiFi-GAN
Speaker anonymization checkpoint used by the Voice Shield panel in MAESTRO's AI Workstation (Tools → Voice Shield → Anonymize tab).
What this is
The prematched HiFi-GAN generator from
bshall/knn-vc, repackaged as a
single .safetensors file. Trained to vocode 1024-dim WavLM-Large layer-6
features back to 16 kHz audio. Pairs with microsoft/wavlm-large (used
upstream from the HuggingFace Hub directly — no need for a separate mirror).
- Architecture: HiFi-GAN with a
lin_preLinear(1024→512) prefix - Parameters: ~16.5 M
- Input: WavLM-Large layer-6 hidden states
[B, T, 1024] - Output: 16 kHz waveform
[B, samples](320× upsample) - License: BSD-3-Clause (original kNN-VC license preserved)
What it's used for in MAESTRO
The Voice Shield panel offers voice-cloning protection with three threat models. The Anonymize tab uses this checkpoint to transform a user's voice into a different synthetic speaker via the kNN-VC pipeline:
- Extract WavLM-Large layer-6 features from the user's voice.
- For each frame, find the k=4 nearest matches in a "target speaker" feature pool and average them.
- Vocode the matched features back to audio with this HiFi-GAN.
The output sounds like the target speaker — so any voice-cloning model trained on the output learns the target's identity, not the user's. This is the only voice-protection paradigm currently robust against adversarial-perturbation strippers like LightShed.
Loading
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
path = hf_hub_download(repo_id="AEmotionStudio/voiceshield-targets",
filename="hifigan_knnvc.safetensors")
state = load_file(path)
# state-dict keys match the bshall prematch_g_02500000.pt schema.
MAESTRO loads this through the vendor module at
backend/ai/voiceshield/knn_vc_vendor/hifigan.py.
Credits
- Original kNN-VC: Matthew Baas, Benjamin van Niekerk, Herman Kamper — "Voice Conversion With Just Nearest Neighbors", Interspeech 2023. Code: https://github.com/bshall/knn-vc.
- WavLM-Large: Microsoft (https://huggingface.co/microsoft/wavlm-large).
Honest framing
Anonymization is a one-way transform. The original speaker identity is unrecoverable from the output — that's the point, and the trade-off the user opts into.