TEA-ASR-1.1-mini · Taiwan Everyday Audio 🍵

TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech into natural Traditional Chinese with authentic Taiwan vocabulary, and it stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR loads and runs exactly like stock Qwen3-ASR — no converters, no post-processing.

TEA-ASR-1.1-mini is the second-generation compact model (780M). Compared with TEA-ASR-1-mini, this release substantially improves code-switching — the ASCEND and CSZS error rates drop by 0.9 and 0.7 points absolute — while keeping CommonVoice and lecture recognition within a tenth of a point. If your audio mixes Mandarin and English (meetings, lectures, tech talk, everyday Taiwan speech), 1.1-mini is the recommended model.

What's new in 1.1

🔀 Code-switch leap — ASCEND 12.49 → 11.59, CSZS 13.21 → 12.55. Trained with an English-span-preservation data balance so embedded English is transcribed, not translated.
🍵 TaiMECS in the loop — includes TaiMECS (Taiwan Mandarin–English code-switching sentences, CC-BY-4.0), with hand-verified Taiwan readings.
🔬 Error-analysis-driven data — targeted real-speech supplements (acronym conventions, inanimate-pronoun 它, numeral formatting) mined from public corpora fixed the exact regression buckets found by transcript-level error analysis.
🪶 Still <10 hours of training speech, LoRA-adapted and merged into a stock checkpoint.

Quick start

pip install qwen-asr

from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1.1-mini")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text)   # -> Traditional Chinese with Taiwan lexicon

Set language="Chinese" for Taiwan speech (recommended). You can also pass a context= string of hotwords (names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.

Benchmark results

Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol (content-fair fold: both sides OpenCC t2s, punctuation stripped, lowercase English, mixed CJK-char/EN-word tokenization — the protocol published by Breeze-ASR-25). Bold = this model.

Benchmark	TEA-ASR-1.1-mini	TEA-ASR-1-mini	Qwen3-ASR-0.6B	Breeze-ASR-25	Whisper-large-v3
CommonVoice 19 (zh-TW)	5.18	5.14	5.79	8.03	10.17
ASCEND (zh-en)	11.59	12.49	12.54	17.53	19.61
CSZS (zh-en)	12.55	13.21	16.03	12.18	23.24
NTUML2021	7.48	7.37	11.03	7.50	9.68

How to read this. 1.1-mini takes clear leads on both code-switching suites (ASCEND, CSZS) over every 0.6B-class system listed, while CommonVoice and NTUML2021 stay within 0.04/0.11 of TEA-ASR-1-mini. The metric folds away script differences, so it does not reflect the decisive practical change versus the base model: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas stock Qwen3-ASR produces Simplified script.

Training

Base: Qwen/Qwen3-ASR-0.6B (AuT encoder + Qwen3 decoder). LoRA (r16) on the decoder plus a low-LR encoder LoRA, trained with latent-script targets and minimal Traditional injection, then merged (merge_and_unload) into a stock checkpoint with a decorated tokenizer (bit-exact decode verified on 152k+ sequences).
Data (< 10 h real speech total): NTU ML2021 lecture train split, Common Voice 19 zh-TW (validated_without_test), ASCEND train split, and TaiMECS (CC-BY-4.0; 20 human-recorded + 80 VoxCPM2 voice-cloned clips with dictionary-verified Taiwan readings). Targeted supplements were mined from the ML2021 train split for acronym/numeral conventions and pronoun usage.
No runtime post-processing: inference is audio → model.generate() → tokenizer.decode().

Known limitations

On a diagnostic panel of 19 hard one-to-many Simplified→Traditional character classes (e.g. 干→乾/幹, 面→面/麵) drawn from adversarially mined contexts, 1.1-mini selects the non-default variant less accurately than TEA-ASR-1-mini. Its false-positive rate is 0% (it never over-applies rare variants), and common-context variant choice (lecture/CommonVoice domains) is unaffected. Closing this gap needs more hard-context speech and is the focus of the next release.
NTUML2021 and CommonVoice are marginally (≤ 0.11) behind TEA-ASR-1-mini; choose that model if pure Mandarin lecture transcription is the only workload.
English-only utterances within ASCEND remain harder (en subset 25.9 MER) than for TEA-ASR-1-mini (24.4).

Evaluation

All benchmarks were re-measured in one run on the full test sets (CommonVoice 19 zh-TW n=5013, ASCEND n=1315, CSZS zh-en n=3176, NTUML2021 n=2000) with the content-fair protocol above. Training excluded all evaluation splits (CommonVoice used validated_without_test; ML2021/ASCEND used their train splits; CSZS was never trained on).

Acknowledgements

Base model: Qwen3-ASR by Alibaba Cloud.
TaiMECS (CC-BY-4.0).
Benchmarks: Common Voice (Mozilla), ASCEND (CAiRE), CSZS, NTU ML2021 (Hung-yi Lee's course corpus).

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for JacobLinCool/TEA-ASR-1.1-mini

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(39)

this model