mozilla-foundation/common_voice_17_0
Updated • 5.54k • 16
Fine-tuned model for Cantonese (yue) speech recognition.
| Metric | Value |
|---|---|
| CER (no punctuation) | 7.86% |
| CER (raw) | 9.62% |
| Eval Loss | 0.2437 |
| Best Step | 4500 |
| Best Epoch | 25.01 |
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 500 | 2.03 | 0.9878 | 13.29% | 18.11% |
| 1000 | 5.02 | 0.5829 | 9.24% | 13.33% |
| 1500 | 8.01 | 0.4137 | 8.86% | 11.46% |
| 2000 | 11.01 | 0.3403 | 8.52% | 10.71% |
| 2500 | 14.00 | 0.2991 | 8.28% | 10.39% |
| 3000 | 16.03 | 0.2746 | 8.12% | 9.96% |
| 3500 | 19.02 | 0.2590 | 7.99% | 9.80% |
| 4000 | 22.02 | 0.2489 | 7.93% | 9.71% |
| 4500 | 25.01 | 0.2437 | 7.86% | 9.62% |
| 5000 | 28.00 | 0.2411 | 7.87% | 9.64% |
| Split | CER (raw) | CER (nopunct) |
|---|---|---|
| test_yue | 9.66% | 8.22% |
| holdback_yue | 10.31% | 8.47% |
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1-baseline
tensorboard --logdir whisper-large-v3-yue-test1-baseline/runs
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
input_features = processor(
audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)