Instructions to use JacobLinCool/TEA-ASR-1.1-mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use JacobLinCool/TEA-ASR-1.1-mini with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="JacobLinCool/TEA-ASR-1.1-mini")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("JacobLinCool/TEA-ASR-1.1-mini") model = AutoModelForMultimodalLM.from_pretrained("JacobLinCool/TEA-ASR-1.1-mini") - Notebooks
- Google Colab
- Kaggle
TEA-ASR-1.1-mini · Taiwan Everyday Audio 🍵
TEA-ASR is an open, drop-in speech-recognition model purpose-built for Taiwan Mandarin. It turns real speech into natural Traditional Chinese with authentic Taiwan vocabulary, and it stays robust through the everyday Mandarin–English code-switching common in Taiwan. Adapted from the state-of-the-art Qwen3-ASR foundation and merged into a single self-contained checkpoint, TEA-ASR loads and runs exactly like stock Qwen3-ASR — no converters, no post-processing.
TEA-ASR-1.1-mini is the second-generation compact model (780M). Compared with
TEA-ASR-1-mini, this release substantially improves
code-switching — the ASCEND and CSZS error rates drop by 0.9 and 0.7 points absolute — while keeping
CommonVoice and lecture recognition within a tenth of a point. If your audio mixes Mandarin and English
(meetings, lectures, tech talk, everyday Taiwan speech), 1.1-mini is the recommended model.
What's new in 1.1
- 🔀 Code-switch leap — ASCEND 12.49 → 11.59, CSZS 13.21 → 12.55. Trained with an English-span-preservation data balance so embedded English is transcribed, not translated.
- 🍵 TaiMECS in the loop — includes TaiMECS (Taiwan Mandarin–English code-switching sentences, CC-BY-4.0), with hand-verified Taiwan readings.
- 🔬 Error-analysis-driven data — targeted real-speech supplements (acronym conventions, inanimate-pronoun 它, numeral formatting) mined from public corpora fixed the exact regression buckets found by transcript-level error analysis.
- 🪶 Still <10 hours of training speech, LoRA-adapted and merged into a stock checkpoint.
Quick start
pip install qwen-asr
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained("JacobLinCool/TEA-ASR-1.1-mini")
result = model.transcribe(audio="utterance.wav", language="Chinese")[0]
print(result.text) # -> Traditional Chinese with Taiwan lexicon
Set language="Chinese" for Taiwan speech (recommended). You can also pass a context= string of hotwords
(names, jargon) for contextual biasing, exactly as with the base Qwen3-ASR.
Benchmark results
Mixed Error Rate (MER%, lower is better), all numbers from a single self-measured run under one protocol (content-fair fold: both sides OpenCC t2s, punctuation stripped, lowercase English, mixed CJK-char/EN-word tokenization — the protocol published by Breeze-ASR-25). Bold = this model.
| Benchmark | TEA-ASR-1.1-mini | TEA-ASR-1-mini | Qwen3-ASR-0.6B | Breeze-ASR-25 | Whisper-large-v3 |
|---|---|---|---|---|---|
| CommonVoice 19 (zh-TW) | 5.18 | 5.14 | 5.79 | 8.03 | 10.17 |
| ASCEND (zh-en) | 11.59 | 12.49 | 12.54 | 17.53 | 19.61 |
| CSZS (zh-en) | 12.55 | 13.21 | 16.03 | 12.18 | 23.24 |
| NTUML2021 | 7.48 | 7.37 | 11.03 | 7.50 | 9.68 |
How to read this. 1.1-mini takes clear leads on both code-switching suites (ASCEND, CSZS) over every 0.6B-class system listed, while CommonVoice and NTUML2021 stay within 0.04/0.11 of TEA-ASR-1-mini. The metric folds away script differences, so it does not reflect the decisive practical change versus the base model: TEA-ASR emits Traditional script and Taiwan vocabulary natively, whereas stock Qwen3-ASR produces Simplified script.
Training
- Base: Qwen/Qwen3-ASR-0.6B (AuT encoder + Qwen3 decoder). LoRA (r16) on the decoder plus a low-LR encoder
LoRA, trained with latent-script targets and minimal Traditional injection, then merged (
merge_and_unload) into a stock checkpoint with a decorated tokenizer (bit-exact decode verified on 152k+ sequences). - Data (< 10 h real speech total): NTU ML2021 lecture train split, Common Voice 19 zh-TW
(
validated_without_test), ASCEND train split, and TaiMECS (CC-BY-4.0; 20 human-recorded + 80 VoxCPM2 voice-cloned clips with dictionary-verified Taiwan readings). Targeted supplements were mined from the ML2021 train split for acronym/numeral conventions and pronoun usage. - No runtime post-processing: inference is
audio → model.generate() → tokenizer.decode().
Known limitations
- On a diagnostic panel of 19 hard one-to-many Simplified→Traditional character classes (e.g. 干→乾/幹, 面→面/麵) drawn from adversarially mined contexts, 1.1-mini selects the non-default variant less accurately than TEA-ASR-1-mini. Its false-positive rate is 0% (it never over-applies rare variants), and common-context variant choice (lecture/CommonVoice domains) is unaffected. Closing this gap needs more hard-context speech and is the focus of the next release.
- NTUML2021 and CommonVoice are marginally (≤ 0.11) behind TEA-ASR-1-mini; choose that model if pure Mandarin lecture transcription is the only workload.
- English-only utterances within ASCEND remain harder (en subset 25.9 MER) than for TEA-ASR-1-mini (24.4).
Evaluation
All benchmarks were re-measured in one run on the full test sets (CommonVoice 19 zh-TW n=5013, ASCEND n=1315,
CSZS zh-en n=3176, NTUML2021 n=2000) with the content-fair protocol above. Training excluded all evaluation
splits (CommonVoice used validated_without_test; ML2021/ASCEND used their train splits; CSZS was never trained
on).
Acknowledgements
- Downloads last month
- -
Model tree for JacobLinCool/TEA-ASR-1.1-mini
Base model
Qwen/Qwen3-ASR-0.6B