Instructions to use WhissleAI/STT-meta-ZH-150m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use WhissleAI/STT-meta-ZH-150m with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("WhissleAI/STT-meta-ZH-150m") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
STT-meta-ZH-100m
A dual-head Mandarin Chinese ASR model that simultaneously performs speech-to-text transcription and speaker attribute classification (age, gender, dialect) in a single forward pass.
Built on NVIDIA Citrinet-1024 with language-specific bottleneck adapters and a trailing tag classifier head, fine-tuned on 60 hours of meta-annotated Mandarin speech data using PromptingNemo.
| Metric | Value |
|---|---|
| Parameters | 157.7M |
| WER | 19.22% |
| Tag Accuracy | 94.2% |
| Language | Mandarin Chinese (zh) |
| Audio | 16kHz mono |
Architecture
Audio (16kHz) ──▶ Mel Spectrogram (80-dim) ──▶ Citrinet-1024 Encoder (23 blocks)
│
┌─────────┴─────────┐
▼ ▼
CTC Decoder Tag Classifier
(5001 vocab) (3 linear heads)
│ │
▼ ▼
Transcription + AGE / GENDER /
Entity Tags DIALECT labels
Parameter Breakdown
| Component | Parameters | Description |
|---|---|---|
| Citrinet-1024 Encoder | 140.4M | 23 Jasper-style blocks with squeeze-excitation |
| Language Adapter | 12.1M | Bottleneck adapters (dim=256) in each encoder block |
| CTC Decoder | 5.1M | Conv1d projecting 1024 → 5001 (BPE vocab + blank) |
| Tag Classifier | 12.3K | 3 linear heads on mean-pooled encoder output |
| Total | 157.7M |
Tag Categories
| Category | Classes | Labels |
|---|---|---|
| AGE | 5 | NONE, AGE_14_25, AGE_26_40, AGE_<14, AGE_>41 |
| GENDER | 3 | NONE, GENDER_FEMALE, GENDER_MALE |
| DIALECT | 4 | NONE, DIALECT_NORTH, DIALECT_OTHERS, DIALECT_SOUTH |
The CTC head also outputs inline entity tags (e.g., ENTITY_PERSON_NAME ... END, ENTITY_TEMPERATURE ... END) as part of the transcription vocabulary.
Files
| File | Description |
|---|---|
zh-citrinet-meta-v11.nemo |
Full NeMo checkpoint (encoder + decoder + adapter + tag classifier) |
onnx/model.onnx |
ONNX model with dual outputs: logprobs (CTC) + encoder_output |
onnx/tag_classifier.onnx |
Standalone tag classifier (input: pooled encoder features) |
onnx/tag_classifier.json |
Tag classifier metadata (labels, class counts) |
onnx/config.json |
Preprocessor configuration (mel spectrogram parameters) |
onnx/tokenizer.model |
SentencePiece BPE tokenizer (5000 tokens) |
onnx/vocabulary.json |
Full vocabulary list with token mappings |
Usage
NeMo Inference
import nemo.collections.asr as nemo_asr
# Standard NeMo transcription (CTC head only — tag classifier weights
# are stored in the checkpoint but EncDecCTCModelBPE does not load them
# by default). For full dual-head inference, use ONNX or PromptingNemo.
asr_model = nemo_asr.models.ASRModel.from_pretrained(
"WhissleAI/STT-meta-ZH-100m"
)
transcriptions = asr_model.transcribe(["audio.wav"])
print(transcriptions[0])
# Output includes inline tags:
# "你好世界。 AGE_26_40 GENDER_MALE ENTITY_PERSON_NAME 张三 END"
PromptingNemo Inference (Full Dual-Head)
For full dual-head inference with the tag classifier, use the PromptingNemo training framework:
# Clone PromptingNemo
# git clone https://github.com/WhissleAI/PromptingNemo.git
import torch
from huggingface_hub import hf_hub_download
# Download the .nemo checkpoint
nemo_path = hf_hub_download(
repo_id="WhissleAI/STT-meta-ZH-100m",
filename="zh-citrinet-meta-v11.nemo"
)
# Load with PromptingNemo's custom model class that includes the tag classifier
# See: https://github.com/WhissleAI/PromptingNemo/blob/main/scripts/asr/meta-asr
from scripts.asr.meta_asr.tag_classifier import (
TrailingTagClassifier,
build_trailing_tag_maps,
masked_mean_pool,
)
# The tag_classifier weights are stored inside the .nemo archive.
# PromptingNemo's training script loads them automatically.
ONNX Inference (Production — Recommended)
Self-contained inference using only onnxruntime, numpy, soundfile, and sentencepiece:
import json
import numpy as np
import onnxruntime as ort
import soundfile as sf
import sentencepiece as spm
from huggingface_hub import hf_hub_download
# Download model files
repo = "WhissleAI/STT-meta-ZH-100m"
model_path = hf_hub_download(repo, "onnx/model.onnx")
cls_path = hf_hub_download(repo, "onnx/tag_classifier.onnx")
cls_meta_path = hf_hub_download(repo, "onnx/tag_classifier.json")
tok_path = hf_hub_download(repo, "onnx/tokenizer.model")
vocab_path = hf_hub_download(repo, "onnx/vocabulary.json")
config_path = hf_hub_download(repo, "onnx/config.json")
# Load config and vocabulary
with open(config_path) as f:
config = json.load(f)
with open(vocab_path) as f:
vocab_data = json.load(f)
with open(cls_meta_path) as f:
cls_meta = json.load(f)
vocabulary = vocab_data["vocabulary"]
blank_id = vocab_data.get("blank_id", len(vocabulary))
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load(tok_path)
# Load ONNX sessions
asr_session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
cls_session = ort.InferenceSession(cls_path, providers=["CPUExecutionProvider"])
# --- Preprocessing ---
def preprocess_audio(audio_path, config):
"""Convert audio to log-mel spectrogram features."""
audio, sr = sf.read(audio_path, dtype="float32")
if sr != 16000:
raise ValueError(f"Expected 16kHz audio, got {sr}Hz")
if audio.ndim > 1:
audio = audio.mean(axis=1)
# Preemphasis
preemph = config["preprocessor"]["preemph"]
audio = np.concatenate([[audio[0]], audio[1:] - preemph * audio[:-1]])
# STFT
n_fft = config["preprocessor"]["n_fft"]
hop = config["preprocessor"]["hop_length"]
win = config["preprocessor"]["win_length"]
window = np.hanning(win + 1)[:-1].astype(np.float32)
# Pad audio
pad_len = (n_fft - hop) // 2
audio = np.pad(audio, (pad_len, pad_len), mode="reflect")
frames = []
for start in range(0, len(audio) - n_fft + 1, hop):
frame = audio[start : start + n_fft] * np.pad(window, (0, n_fft - win))
frames.append(np.fft.rfft(frame))
spec = np.abs(np.array(frames, dtype=np.complex64)) ** 2
# Mel filterbank
n_mels = config["preprocessor"]["features"]
fmin = config["preprocessor"]["lowfreq"]
fmax = sr / 2 if config["preprocessor"]["highfreq"] is None else config["preprocessor"]["highfreq"]
mel_points = np.linspace(
2595 * np.log10(1 + fmin / 700),
2595 * np.log10(1 + fmax / 700),
n_mels + 2,
)
hz_points = 700 * (10 ** (mel_points / 2595) - 1)
bins = np.floor((n_fft + 1) * hz_points / sr).astype(int)
fbank = np.zeros((n_mels, n_fft // 2 + 1))
for i in range(n_mels):
for j in range(bins[i], bins[i + 1]):
fbank[i, j] = (j - bins[i]) / max(bins[i + 1] - bins[i], 1)
for j in range(bins[i + 1], bins[i + 2]):
fbank[i, j] = (bins[i + 2] - j) / max(bins[i + 2] - bins[i + 1], 1)
mel_spec = spec @ fbank.T
log_mel = np.log(mel_spec + config["preprocessor"]["log_zero_guard_value"])
# Per-feature normalization
mean = log_mel.mean(axis=0, keepdims=True)
std = log_mel.std(axis=0, keepdims=True)
log_mel = (log_mel - mean) / (std + 1e-5)
# Pad to multiple of 16
pad_to = config["preprocessor"].get("pad_to", 16)
T = log_mel.shape[0]
if T % pad_to != 0:
pad_frames = pad_to - (T % pad_to)
log_mel = np.pad(log_mel, ((0, pad_frames), (0, 0)))
# Shape: [1, features, time]
features = log_mel.T[np.newaxis, :, :].astype(np.float32)
return features, T
# --- Inference ---
features, valid_len = preprocess_audio("audio.wav", config)
length = np.array([features.shape[2]], dtype=np.int64)
# Run ASR model (dual output)
logprobs, encoder_output = asr_session.run(
["logprobs", "encoder_output"],
{"audio_signal": features, "length": length},
)
# Greedy CTC decode
pred_ids = np.argmax(logprobs[0], axis=-1)
# Collapse repeats and remove blanks
decoded_ids = []
prev = -1
for idx in pred_ids:
if idx != prev and idx != blank_id:
decoded_ids.append(int(idx))
prev = idx
transcript = sp.DecodeIds(decoded_ids)
print(f"Transcript: {transcript}")
# --- Tag Classification ---
# encoder_output shape: [1, 1024, T] -> transpose to [1, T, 1024]
enc = encoder_output.transpose(0, 2, 1)
# Masked mean pooling
mask = np.zeros((1, enc.shape[1], 1), dtype=np.float32)
mask[0, :valid_len // 8, :] = 1.0 # Citrinet has 8x downsampling
pooled = (enc * mask).sum(axis=1) / mask.sum(axis=1).clip(min=1)
# Run tag classifier
tag_outputs = cls_session.run(None, {"pooled_encoder": pooled.astype(np.float32)})
categories = cls_meta["categories"]
for cat_name, cat_info in sorted(categories.items()):
idx = list(sorted(categories.keys())).index(cat_name)
pred = int(np.argmax(tag_outputs[idx][0]))
label = cat_info["labels"][pred]
print(f" {cat_name}: {label}")
Example output:
Transcript: 来首歌吻别。 AGE_14_25 GENDER_FEMALE
AGE: AGE_14_25
DIALECT: DIALECT_SOUTH
GENDER: GENDER_FEMALE
Training Details
| Setting | Value |
|---|---|
| Base Model | stt_zh_citrinet_1024_gamma_0_25.nemo (NVIDIA) |
| Framework | NeMo + PromptingNemo |
| Training Data | 60,098 samples / 60 hours (AISHELL-3 with meta-tags) |
| Test Data | 24,772 samples / 22.4 hours |
| Optimizer | Adam (lr=5e-4, weight_decay=0, warmup=2000 steps) |
| LR Schedule | CosineAnnealing (min_lr=1e-6) |
| Batch Size | 16 (effective 32 with grad accumulation 2) |
| Max Duration | 16s |
| Mixed Precision | FP16 |
| Spec Augment | 4 time masks, width 80 |
| Adapter | Bottleneck (dim=256, activation=swish, norm=pre) |
| Tag Classifier Weight | 0.1 (auxiliary loss) |
| Hardware | 1x NVIDIA T4 16GB |
| Training Steps | 18,000+ (best at step 17,005) |
| Tokenizer | SentencePiece BPE (5,000 tokens) |
What Makes This Model Different
Unlike standard ASR models, this model:
- Outputs structured metadata — AGE, GENDER, and DIALECT predictions via a separate classification head on the encoder output, without affecting CTC alignment
- Inline entity recognition — Named entities (PERSON_NAME, TEMPERATURE, DATE, etc.) are tagged directly in the transcript using
ENTITY_TYPE ... ENDmarkers - Adapter-based fine-tuning — Only the bottleneck adapters (12.1M params) and tag classifier (12.3K params) are trained; the base Citrinet encoder is frozen
- ONNX-ready — Dual-output ONNX graph exposes both CTC logprobs and raw encoder features for the tag classifier
Evaluation Results
Transcription Quality (CER)
Evaluated on 500 samples from the AISHELL-3 test set, with all meta-tags stripped for fair character-level comparison:
| Model | Params | CER | Additional Outputs |
|---|---|---|---|
| nvidia/stt_zh_citrinet_1024 | 140M | 3.19% | Transcription only |
| WhissleAI/STT-meta-ZH-100m | 157.7M | 11.31% | + AGE, GENDER, DIALECT, Entities |
The meta model trades ~8% CER for rich per-utterance metadata. The CTC head must learn to output both transcription tokens and inline entity tags (e.g., ENTITY_PERSON_NAME ... END), which reduces pure transcription accuracy compared to the base model that only does transcription.
Tag Classification Accuracy
| Category | Accuracy |
|---|---|
| Overall tags | 94.2% |
Meta-ASR WER (including tags)
| Split | WER (with tags) |
|---|---|
| Test | 19.22% |
Limitations
- Trained primarily on AISHELL-3 data — may not generalize well to spontaneous/noisy Mandarin speech
- Limited dialect diversity (North/South/Others) — does not cover specific regional varieties
- Age classification uses broad buckets (
<14,14-25,26-40,>41) - Entity recognition is limited to entity types seen in training data
Citation
@misc{whissle2025sttmetazh,
title={STT-meta-ZH-100m: Dual-Head Mandarin ASR with Speaker Attribute Classification},
author={WhissleAI},
year={2025},
url={https://huggingface.co/WhissleAI/STT-meta-ZH-100m}
}
License
Apache 2.0
- Downloads last month
- 11