whisper-large-v3-yue-test1-baseline

Fine-tuned model for Cantonese (yue) speech recognition.

Evaluation Results

Metric Value
CER (no punctuation) 7.86%
CER (raw) 9.62%
Eval Loss 0.2437
Best Step 4500
Best Epoch 25.01

Training History

Step Epoch Eval Loss CER (nopunct) CER (raw)
500 2.03 0.9878 13.29% 18.11%
1000 5.02 0.5829 9.24% 13.33%
1500 8.01 0.4137 8.86% 11.46%
2000 11.01 0.3403 8.52% 10.71%
2500 14.00 0.2991 8.28% 10.39%
3000 16.03 0.2746 8.12% 9.96%
3500 19.02 0.2590 7.99% 9.80%
4000 22.02 0.2489 7.93% 9.71%
4500 25.01 0.2437 7.86% 9.62%
5000 28.00 0.2411 7.87% 9.64%

Final Evaluation

Split CER (raw) CER (nopunct)
test_yue 9.66% 8.22%
holdback_yue 10.31% 8.47%

Training Details

  • Dataset: mozilla-foundation/common_voice_17_0 (yue)
  • Language: Cantonese (yue)
  • Task: Automatic Speech Recognition (ASR)
  • Architecture: Encoder-Decoder (Seq2Seq)
  • Metric: Character Error Rate (CER)
  • Total training steps: 5310

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1-baseline
tensorboard --logdir whisper-large-v3-yue-test1-baseline/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Downloads last month
16
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train awong-dev/whisper-large-v3-yue-test1-baseline

Evaluation results