Model Card: shooding/taiwan-breeze-asr-26

Overview

A CTranslate2 float16 LoRA fine-tune of MediaTek-Research/Breeze-ASR-26 (itself a Whisper-large-v2 derivative for Taiwanese Hokkien / Taigi), fine-tuned with Unsloth on adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw plus optional user-provided Taigi recordings, then converted to CTranslate2 for use with faster-whisper.

Companion Mandarin repo with the same training recipe and release format: shooding/faster-whisper-large-v3-zh-TW.

Model Details

Uses

Direct Use

Transcribing Taiwanese Hokkien audio into Traditional Chinese characters (漢字, not Tâi-lô) via the faster-whisper library — real-time or batch pipelines.

Downstream Use

Taigi voice assistants, Taigi subtitle generation, Taigi↔Mandarin code-switching transcription, low-latency on-device inference (via CT2 int8/int8_float16 quantization).

Out-of-Scope Use

  • Tâi-lô romanization output (model emits Han characters only)
  • Other Sinitic languages (Cantonese, Hakka, Min-dong)
  • Languages outside Taigi / Mandarin / English-CS

Getting Started

GPU (Recommended)

from faster_whisper import WhisperModel

model = WhisperModel(
    'shooding/taiwan-breeze-asr-26',
    device='cuda',
    compute_type='float16',
)

segments, info = model.transcribe('taigi_clip.wav', language='zh', task='transcribe')
for seg in segments:
    print(f'[{seg.start:.2f}s → {seg.end:.2f}s] {seg.text}')

CPU (int8 quantization)

model = WhisperModel(
    'shooding/taiwan-breeze-asr-26',
    device='cpu',
    compute_type='int8',
)

Training Details

Training Data

  • Primary (regularizer): adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (Mandarin + English CS), streaming
  • Secondary (target): user-provided Taigi recordings via audiofolder
  • Interleaved with CUSTOM_PROB controlling Taigi exposure (default 0.0625 ≈ 10 epochs over 200 Taigi clips)

Build Pipeline

  1. Load MediaTek-Research/Breeze-ASR-26 via unsloth.FastModel
  2. Apply LoRA adapters (r=64, α=64, target: q_proj, v_proj)
  3. Set generation_config: language=zh, task=transcribe
  4. Fine-tune with Seq2SeqTrainer on the interleaved stream
  5. Merge LoRA → full fp16 HF model
  6. Convert to CTranslate2 with float16 quantization (ct2-transformers-converter)

LoRA Configuration

Parameter Value
r 64
lora_alpha 64
target_modules q_proj, v_proj
lora_dropout 0
bias none
task_type None (required for Whisper)

Training Hyperparameters

Hyperparameter Value
max_steps 2000
per_device_train_batch_size 4
gradient_accumulation_steps 4 (effective batch = 16)
learning_rate 1e-4
warmup_steps 100
lr_scheduler_type cosine
optimizer adamw_8bit (Unsloth)
weight_decay 0.001
eval_steps / save_steps 200
best model metric CER (lower is better)

Hardware & Environment

  • Hardware: Google Colab (T4 / A100 class)
  • LoRA Efficiency: ~2% of parameters trained, 50%+ VRAM reduction vs. full fine-tuning

Evaluation

Testing Data

Held-out split of adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (200 samples). This eval is a Mandarin retention signal, not a Taigi quality signal — run separate inference on a Taigi benchmark (e.g. Breeze Taigi test set) for target-language CER.

Metrics

CER (Character Error Rate): edit distance / reference length.

During training the notebook disables predict_with_generate and derives CER from teacher-forced logits argmax; reported values are therefore inflated and should only be read as a monotonicity signal alongside validation loss.

Technical Specifications

  • Architecture: Whisper large-v2 encoder-decoder transformer (~1.5B params), Breeze-ASR-26 pre-adapted
  • Format: CTranslate2, float16 (quantizable to int8 at load time)
  • Software Stack: Unsloth, Transformers 4.56.2, TRL 0.22.2, PEFT, CTranslate2, faster-whisper

Citation

@misc{shooding2026taiwanbreezeasr26,
  author       = {shooding},
  title        = {taiwan-breeze-asr-26: CTranslate2 LoRA fine-tune of Breeze-ASR-26 for Taiwanese Hokkien},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/shooding/taiwan-breeze-asr-26}},
}

Acknowledgements

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shooding/taiwan-breeze-asr-26

Adapter
(1)
this model

Dataset used to train shooding/taiwan-breeze-asr-26