CLD β€” Low-Resource Convex Language-Detection Heads (whisper-small, en/zh)

Convex Low-resource Accent-Robust Language Detection (CLD) heads for the binary English/Chinese spoken language detection task, trained on frozen openai/whisper-small encoder embeddings. This repo contains the low-resource sweep: the same convex ReLU-MLP head trained at four training-set sizes β€” 100, 500, 1000, 10000 samples per class β€” to study data efficiency.

paper code pypi

Model description

Each artifact is a two-layer convex ReLU MLP spoken language detection head trained on mean-pooled frozen openai/whisper-small encoder embeddings (dim 768). It is trained by solving a convex reformulation with CRONOS (ADMM in JAX) rather than standard NN training. At inference it predicts the spoken language (en/zh) and selects the matching Whisper language token before decoding. Each is a CVX_ReLU_MLP with theta1 (2, 768, 128), theta2 (2, 128), n_classes = 2. Label order is the sorted ISO-639-1 codes: 0β†’en, 1β†’zh.

How to use

Loading the head requires JAX (the weights are JAX arrays):

pip install jaxcld jax
import numpy as np
from huggingface_hub import hf_hub_download
from jaxcld import ASRModel, CVXNNLangDetectHead

languages = ["en", "zh"]
config = "1000"   # one of: 100, 500, 1000, 10000

asr = ASRModel.from_pretrained("openai/whisper-small", config={"languages": languages})
head_path = hf_hub_download("williamhtan/cld-whisper-small-enzh", f"{config}/model.pkl")
head = CVXNNLangDetectHead.load(head_path, asr)

asr.set_lang_detect_head(head)
audio_16k_mono: np.ndarray = ...   # shape (T,), 16 kHz mono
pred_langs, pred_texts = asr.predict(audio_16k_mono)
print(pred_langs[0], pred_texts[0])

Contents

One convex head per training-set size, under a per-config subfolder:

williamhtan/cld-whisper-small-enzh
β”œβ”€β”€ 100/model.pkl
β”œβ”€β”€ 500/model.pkl
β”œβ”€β”€ 1000/model.pkl
└── 10000/model.pkl

Results

Detection accuracy / WER / CER from benchmark_cld.py on the held-out test split for these exact artifacts. Validation peak = best test-split accuracy during CRONOS training. Training time and TFLOPs from the training run.

Samples/class Det. acc WER (↓) CER (↓) Val peak data_seed Train time (s) TFLOPs
100 1.00 (n=20) 42.11 38.17 1.000 3 36.1 8,340
500 0.99 (n=100) 28.03 36.30 1.000 1 50.3 11,623
1000 0.99 (n=200) 31.68 31.22 0.985 6 64.4 14,871
10000 0.99 (n=1860) 27.67 28.47 0.989 6 310.1 71,642

Low-resource WER and detection accuracy vs. training-set size

The convex head stays at ~0.99–1.00 detection accuracy across all sample sizes, including the 100-sample regime. The 100-sample test split is tiny (n=20), so its WER/CER are higher-variance.

Training

Trained with train_cvxnn.py (CRONOS / ADMM in JAX). Shared hyperparameters: rank=20, neuron=64, beta=0.001, rho=0.1, gamma_ratio=1, admm_iters=6, pcg_iters=32, opt_seed=1024 (per-config data_seed in the table above). Inputs are mean-pooled frozen openai/whisper-small encoder embeddings (dim 768).

Citation

@inproceedings{feng2026cld,
  title     = {Convex Low-resource Accent-Robust Language Detection in Speech Recognition},
  author    = {Feng, Miria and Tan, William and Pilanci, Mert},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  series    = {Proceedings of Machine Learning Research},
  publisher = {PMLR},
  url       = {https://arxiv.org/abs/2605.23235}
}

License

MIT β€” see the CLD repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for williamhtan/cld-whisper-small-enzh

Finetuned
(3540)
this model

Paper for williamhtan/cld-whisper-small-enzh