whisper-large-v3-yue-test1b-baseline-fp16-8bit

Fine-tuned model for Cantonese (yue) speech recognition.

Evaluation Results

Metric Value
CER (no punctuation) 7.79%
CER (raw) 9.49%
Eval Loss 0.2241
Best Step 5000
Best Epoch 28.00

Training History

Step Epoch Eval Loss CER (nopunct) CER (raw)
500 2.03 1.1974 22.90% 27.46%
1000 5.02 0.5688 9.22% 13.23%
1500 8.01 0.3839 8.75% 11.18%
2000 11.01 0.3090 8.39% 10.55%
2500 14.00 0.2692 8.10% 9.93%
3000 16.03 0.2478 7.98% 9.75%
3500 19.02 0.2362 7.85% 9.59%
4000 22.02 0.2297 7.79% 9.53%
4500 25.01 0.2259 7.81% 9.51%
5000 28.00 0.2241 7.79% 9.49%

Final Evaluation

Split CER (raw) CER (nopunct)
test_yue 9.49% 8.07%
holdback_yue 10.11% 8.30%

Training Details

  • Dataset: mozilla-foundation/common_voice_17_0 (yue)
  • Language: Cantonese (yue)
  • Task: Automatic Speech Recognition (ASR)
  • Architecture: Encoder-Decoder (Seq2Seq)
  • Metric: Character Error Rate (CER)
  • Total training steps: 5310

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit
tensorboard --logdir whisper-large-v3-yue-test1b-baseline-fp16-8bit/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Downloads last month
27
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train awong-dev/whisper-large-v3-yue-test1b-baseline-fp16-8bit

Evaluation results