mozilla-foundation/common_voice_17_0
Updated • 5.54k • 16
Fine-tuned model for Cantonese (yue) speech recognition.
| Metric | Value |
|---|---|
| CER (no punctuation) | 7.79% |
| CER (raw) | 9.49% |
| Eval Loss | 0.2241 |
| Best Step | 5000 |
| Best Epoch | 28.00 |
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 500 | 2.03 | 1.1974 | 22.90% | 27.46% |
| 1000 | 5.02 | 0.5688 | 9.22% | 13.23% |
| 1500 | 8.01 | 0.3839 | 8.75% | 11.18% |
| 2000 | 11.01 | 0.3090 | 8.39% | 10.55% |
| 2500 | 14.00 | 0.2692 | 8.10% | 9.93% |
| 3000 | 16.03 | 0.2478 | 7.98% | 9.75% |
| 3500 | 19.02 | 0.2362 | 7.85% | 9.59% |
| 4000 | 22.02 | 0.2297 | 7.79% | 9.53% |
| 4500 | 25.01 | 0.2259 | 7.81% | 9.51% |
| 5000 | 28.00 | 0.2241 | 7.79% | 9.49% |
| Split | CER (raw) | CER (nopunct) |
|---|---|---|
| test_yue | 9.49% | 8.07% |
| holdback_yue | 10.11% | 8.30% |
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit
tensorboard --logdir whisper-large-v3-yue-test1b-baseline-fp16-8bit/runs
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
input_features = processor(
audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)