CLD — Low-Resource Convex Language-Detection Heads (whisper-small, en/zh)

Convex Low-resource Accent-Robust Language Detection (CLD) heads for the binary English/Chinese spoken language detection task, trained on frozen openai/whisper-small encoder embeddings. This repo contains the low-resource sweep: the same convex ReLU-MLP head trained at four training-set sizes — 100, 500, 1000, 10000 samples per class — to study data efficiency.

Model description

Each artifact is a two-layer convex ReLU MLP spoken language detection head trained on mean-pooled frozen openai/whisper-small encoder embeddings (dim 768). It is trained by solving a convex reformulation with CRONOS (ADMM in JAX) rather than standard NN training. At inference it predicts the spoken language (en/zh) and selects the matching Whisper language token before decoding. Each is a CVX_ReLU_MLP with theta1 (2, 768, 128), theta2 (2, 128), n_classes = 2. Label order is the sorted ISO-639-1 codes: 0→en, 1→zh.

How to use

Loading the head requires JAX (the weights are JAX arrays):

pip install jaxcld jax

import numpy as np
from huggingface_hub import hf_hub_download
from jaxcld import ASRModel, CVXNNLangDetectHead

languages = ["en", "zh"]
config = "1000"   # one of: 100, 500, 1000, 10000

asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})
head_path = hf_hub_download("williamhtan/cld-whisper-small-enzh", f"{config}/model.pkl")
head = CVXNNLangDetectHead.load(head_path, asr)

asr.set_lang_detect_head(head)
audio_16k_mono: np.ndarray = ...   # shape (T,), 16 kHz mono
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print(pred_langs[0], pred_texts[0])

One convex head per training-set size, under a per-config subfolder:

williamhtan/cld-whisper-small-enzh
├── 100/model.pkl
├── 500/model.pkl
├── 1000/model.pkl
└── 10000/model.pkl

Results

Detection accuracy / WER / CER from benchmark_cld.py on the held-out test split for these exact artifacts. Validation peak = best test-split accuracy during CRONOS training. Training time and TFLOPs from the training run.

Samples/class	Det. acc	WER (↓)	CER (↓)	Val peak	data_seed	Train time (s)	TFLOPs
100	1.00 (n=20)	42.11	38.17	1.000	3	36.1	8,340
500	0.99 (n=100)	28.03	36.30	1.000	1	50.3	11,623
1000	0.99 (n=200)	31.68	31.22	0.985	6	64.4	14,871
10000	0.99 (n=1860)	27.67	28.47	0.989	6	310.1	71,642

The convex head stays at ~0.99–1.00 detection accuracy across all sample sizes, including the 100-sample regime. The 100-sample test split is tiny (n=20), so its WER/CER are higher-variance.

Training

Trained with train_cvxnn.py (CRONOS / ADMM in JAX). Shared hyperparameters: rank=20, neuron=64, beta=0.001, rho=0.1, gamma_ratio=1, admm_iters=6, pcg_iters=32, opt_seed=1024 (per-config data_seed in the table above). Inputs are mean-pooled frozen openai/whisper-small encoder embeddings (dim 768).

Citation

@inproceedings{feng2026cld,
  title     = {Convex Low-resource Accent-Robust Language Detection in Speech Recognition},
  author    = {Feng, Miria and Tan, William and Pilanci, Mert},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  url       = {https://arxiv.org/abs/2605.23235}
}

License

MIT — see the CLD repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for williamhtan/cld-whisper-small-enzh

Base model

openai/whisper-small

Finetuned

(3540)

this model

Paper for williamhtan/cld-whisper-small-enzh

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Paper • 2605.23235 • Published 12 days ago • 5

williamhtan
/

cld-whisper-small-enzh