whisper-large-v3-yue-test1-baseline

Fine-tuned model for Cantonese (yue) speech recognition.

Evaluation Results

Metric	Value
CER (no punctuation)	7.86%
CER (raw)	9.62%
Eval Loss	0.2437
Best Step	4500
Best Epoch	25.01

Training History

Step	Epoch	Eval Loss	CER (nopunct)	CER (raw)
500	2.03	0.9878	13.29%	18.11%
1000	5.02	0.5829	9.24%	13.33%
1500	8.01	0.4137	8.86%	11.46%
2000	11.01	0.3403	8.52%	10.71%
2500	14.00	0.2991	8.28%	10.39%
3000	16.03	0.2746	8.12%	9.96%
3500	19.02	0.2590	7.99%	9.80%
4000	22.02	0.2489	7.93%	9.71%
4500	25.01	0.2437	7.86%	9.62%
5000	28.00	0.2411	7.87%	9.64%

Final Evaluation

Split	CER (raw)	CER (nopunct)
test_yue	9.66%	8.22%
holdback_yue	10.31%	8.47%

Training Details

Dataset: mozilla-foundation/common_voice_17_0 (yue)
Language: Cantonese (yue)
Task: Automatic Speech Recognition (ASR)
Architecture: Encoder-Decoder (Seq2Seq)
Metric: Character Error Rate (CER)
Total training steps: 5310

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1-baseline
tensorboard --logdir whisper-large-v3-yue-test1-baseline/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1-baseline")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Downloads last month: 16

Safetensors

Model size

2B params

Tensor type

F32

Dataset used to train awong-dev/whisper-large-v3-yue-test1-baseline

Evaluation results

CER (no punctuation) on Common Voice (Cantonese)
test set self-reported

0.079
CER (raw) on Common Voice (Cantonese)
test set self-reported

0.096