whisper-large-v3-yue-test1b-baseline-fp16-8bit

Fine-tuned model for Cantonese (yue) speech recognition.

Evaluation Results

Metric	Value
CER (no punctuation)	7.79%
CER (raw)	9.49%
Eval Loss	0.2241
Best Step	5000
Best Epoch	28.00

Training History

Step	Epoch	Eval Loss	CER (nopunct)	CER (raw)
500	2.03	1.1974	22.90%	27.46%
1000	5.02	0.5688	9.22%	13.23%
1500	8.01	0.3839	8.75%	11.18%
2000	11.01	0.3090	8.39%	10.55%
2500	14.00	0.2692	8.10%	9.93%
3000	16.03	0.2478	7.98%	9.75%
3500	19.02	0.2362	7.85%	9.59%
4000	22.02	0.2297	7.79%	9.53%
4500	25.01	0.2259	7.81%	9.51%
5000	28.00	0.2241	7.79%	9.49%

Final Evaluation

Split	CER (raw)	CER (nopunct)
test_yue	9.49%	8.07%
holdback_yue	10.11%	8.30%

Training Details

Dataset: mozilla-foundation/common_voice_17_0 (yue)
Language: Cantonese (yue)
Task: Automatic Speech Recognition (ASR)
Architecture: Encoder-Decoder (Seq2Seq)
Metric: Character Error Rate (CER)
Total training steps: 5310

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit
tensorboard --logdir whisper-large-v3-yue-test1b-baseline-fp16-8bit/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Downloads last month: 27

Safetensors

Model size

2B params

Tensor type

F32

Dataset used to train awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit

Evaluation results

CER (no punctuation) on Common Voice (Cantonese)
test set self-reported

0.078
CER (raw) on Common Voice (Cantonese)
test set self-reported

0.095