CLD β Low-Resource Convex Language-Detection Heads (whisper-small, en/zh)
Convex Low-resource Accent-Robust Language Detection (CLD) heads for the
binary English/Chinese spoken language detection task, trained on frozen
openai/whisper-small encoder embeddings. This repo contains the
low-resource sweep: the same convex ReLU-MLP head trained at four training-set
sizes β 100, 500, 1000, 10000 samples per class β to study data efficiency.
Model description
Each artifact is a two-layer convex ReLU MLP spoken language detection head trained
on mean-pooled frozen openai/whisper-small encoder embeddings (dim 768). It is trained by solving
a convex reformulation with CRONOS (ADMM in JAX) rather than standard NN training.
At inference it predicts the spoken language (en/zh) and selects the matching Whisper
language token before decoding. Each is a CVX_ReLU_MLP with theta1 (2, 768, 128),
theta2 (2, 128), n_classes = 2. Label order is the sorted ISO-639-1 codes:
0βen, 1βzh.
How to use
Loading the head requires JAX (the weights are JAX arrays):
pip install jaxcld jax
import numpy as np
from huggingface_hub import hf_hub_download
from jaxcld import ASRModel, CVXNNLangDetectHead
languages = ["en", "zh"]
config = "1000" # one of: 100, 500, 1000, 10000
asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})
head_path = hf_hub_download("williamhtan/cld-whisper-small-enzh", f"{config}/model.pkl")
head = CVXNNLangDetectHead.load(head_path, asr)
asr.set_lang_detect_head(head)
audio_16k_mono: np.ndarray = ... # shape (T,), 16 kHz mono
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print(pred_langs[0], pred_texts[0])
Contents
One convex head per training-set size, under a per-config subfolder:
williamhtan/cld-whisper-small-enzh
βββ 100/model.pkl
βββ 500/model.pkl
βββ 1000/model.pkl
βββ 10000/model.pkl
Results
Detection accuracy / WER / CER from benchmark_cld.py on the held-out test split for
these exact artifacts. Validation peak = best test-split accuracy during CRONOS
training. Training time and TFLOPs from the training run.
| Samples/class | Det. acc | WER (β) | CER (β) | Val peak | data_seed | Train time (s) | TFLOPs |
|---|---|---|---|---|---|---|---|
| 100 | 1.00 (n=20) | 42.11 | 38.17 | 1.000 | 3 | 36.1 | 8,340 |
| 500 | 0.99 (n=100) | 28.03 | 36.30 | 1.000 | 1 | 50.3 | 11,623 |
| 1000 | 0.99 (n=200) | 31.68 | 31.22 | 0.985 | 6 | 64.4 | 14,871 |
| 10000 | 0.99 (n=1860) | 27.67 | 28.47 | 0.989 | 6 | 310.1 | 71,642 |
The convex head stays at ~0.99β1.00 detection accuracy across all sample sizes, including the 100-sample regime. The 100-sample test split is tiny (n=20), so its WER/CER are higher-variance.
Training
Trained with train_cvxnn.py (CRONOS / ADMM in JAX). Shared hyperparameters:
rank=20, neuron=64, beta=0.001, rho=0.1, gamma_ratio=1, admm_iters=6, pcg_iters=32, opt_seed=1024 (per-config data_seed in the table above). Inputs are mean-pooled
frozen openai/whisper-small encoder embeddings (dim 768).
Citation
@inproceedings{feng2026cld,
title = {Convex Low-resource Accent-Robust Language Detection in Speech Recognition},
author = {Feng, Miria and Tan, William and Pilanci, Mert},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR},
url = {https://arxiv.org/abs/2605.23235}
}
License
MIT β see the CLD repository.
Model tree for williamhtan/cld-whisper-small-enzh
Base model
openai/whisper-small