Model Card: shooding/taiwan-breeze-asr-26

Overview

A CTranslate2 float16 LoRA fine-tune of MediaTek-Research/Breeze-ASR-26 (itself a Whisper-large-v2 derivative for Taiwanese Hokkien / Taigi), fine-tuned with Unsloth on adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw plus optional user-provided Taigi recordings, then converted to CTranslate2 for use with faster-whisper.

Companion Mandarin repo with the same training recipe and release format: shooding/faster-whisper-large-v3-zh-TW.

Model Details

Model Type: Encoder-decoder speech transformer (Whisper architecture), CTranslate2 format
Language(s): Taiwanese Hokkien (Taigi, nan) with Mandarin-character output; Mandarin code-switching retained
Developed by: shooding
Fine-tuned from: MediaTek-Research/Breeze-ASR-26
License: Apache 2.0
Repository: https://github.com/shooding/taiwan-finetune

Uses

Direct Use

Transcribing Taiwanese Hokkien audio into Traditional Chinese characters (漢字, not Tâi-lô) via the faster-whisper library — real-time or batch pipelines.

Downstream Use

Taigi voice assistants, Taigi subtitle generation, Taigi↔Mandarin code-switching transcription, low-latency on-device inference (via CT2 int8/int8_float16 quantization).

Out-of-Scope Use

Tâi-lô romanization output (model emits Han characters only)
Other Sinitic languages (Cantonese, Hakka, Min-dong)
Languages outside Taigi / Mandarin / English-CS

Getting Started

GPU (Recommended)

from faster_whisper import WhisperModel

model = WhisperModel(
    'shooding/taiwan-breeze-asr-26',
    device='cuda',
    compute_type='float16',
)

segments, info = model.transcribe('taigi_clip.wav', language='zh', task='transcribe')
for seg in segments:
    print(f'[{seg.start:.2f}s → {seg.end:.2f}s] {seg.text}')

CPU (int8 quantization)

model = WhisperModel(
    'shooding/taiwan-breeze-asr-26',
    device='cpu',
    compute_type='int8',
)

Training Details

Training Data

Primary (regularizer): adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (Mandarin + English CS), streaming
Secondary (target): user-provided Taigi recordings via audiofolder
Interleaved with CUSTOM_PROB controlling Taigi exposure (default 0.0625 ≈ 10 epochs over 200 Taigi clips)

Build Pipeline

Load MediaTek-Research/Breeze-ASR-26 via unsloth.FastModel
Apply LoRA adapters (r=64, α=64, target: q_proj, v_proj)
Set generation_config: language=zh, task=transcribe
Fine-tune with Seq2SeqTrainer on the interleaved stream
Merge LoRA → full fp16 HF model
Convert to CTranslate2 with float16 quantization (ct2-transformers-converter)

LoRA Configuration

Parameter	Value
r	64
lora_alpha	64
target_modules	q_proj, v_proj
lora_dropout	0
bias	none
task_type	None (required for Whisper)

Training Hyperparameters

Hyperparameter	Value
max_steps	2000
per_device_train_batch_size	4
gradient_accumulation_steps	4 (effective batch = 16)
learning_rate	1e-4
warmup_steps	100
lr_scheduler_type	cosine
optimizer	adamw_8bit (Unsloth)
weight_decay	0.001
eval_steps / save_steps	200
best model metric	CER (lower is better)

Hardware & Environment

Hardware: Google Colab (T4 / A100 class)
LoRA Efficiency: ~2% of parameters trained, 50%+ VRAM reduction vs. full fine-tuning

Evaluation

Testing Data

Held-out split of adi-gov-tw/Taiwan-Tongues-ASR-CE-dataset-zhtw (200 samples). This eval is a Mandarin retention signal, not a Taigi quality signal — run separate inference on a Taigi benchmark (e.g. Breeze Taigi test set) for target-language CER.

Metrics

CER (Character Error Rate): edit distance / reference length.

During training the notebook disables predict_with_generate and derives CER from teacher-forced logits argmax; reported values are therefore inflated and should only be read as a monotonicity signal alongside validation loss.

Technical Specifications

Architecture: Whisper large-v2 encoder-decoder transformer (~1.5B params), Breeze-ASR-26 pre-adapted
Format: CTranslate2, float16 (quantizable to int8 at load time)
Software Stack: Unsloth, Transformers 4.56.2, TRL 0.22.2, PEFT, CTranslate2, faster-whisper

Citation

@misc{shooding2026taiwanbreezeasr26,
  author       = {shooding},
  title        = {taiwan-breeze-asr-26: CTranslate2 LoRA fine-tune of Breeze-ASR-26 for Taiwanese Hokkien},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/shooding/taiwan-breeze-asr-26}},
}

Acknowledgements

MediaTek Research for Breeze-ASR-26
adi-gov-tw for the Taiwan Tongues ASR CE dataset
Unsloth for the LoRA/training stack

Downloads last month: 16

Model tree for shooding/taiwan-breeze-asr-26

Base model

openai/whisper-large-v2

Finetuned

MediaTek-Research/Breeze-ASR-26

Adapter

(1)

this model

shooding
/

taiwan-breeze-asr-26