YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
๐ฎ๐ณ Optimal Curriculum Learning for Multilingual Indian ASR
Three-Phase Curriculum + Cooldown Temperature Sampling for 22 Scheduled Indian Languages
Overview
This repository contains a literature-grounded data design strategy for fine-tuning ASR-LLM models (Gemma 4 E4B, OmniASR, Qwen3-ASR) on all 22 scheduled Indian languages with highly imbalanced data distributions (90h โ 8,500h).
The core challenge: improve low-resource language ASR without degrading high-resource performance.
Key Insight
No single static temperature works. Based on 8 key papers, we combine:
- Cooldown dynamic temperature scheduling (ฯ: 5โ2)
- Three-phase curriculum: Foundation โ Expansion โ Refinement
- Replay buffers for anti-forgetting
- Language-specific adapters (MMS LSAH pattern)
Results from Ablation Simulation
| Strategy | Tier 1 (HR) | Tier 2 (MR) | Tier 3 (LR) | Max/Min Ratio |
|---|---|---|---|---|
| Proportional (ฯ=1) | 95.0% | 3.3% | 1.7% | 101x |
| Static ฯ=5 | 57.1% | 15.7% | 27.2% | 2.4x |
| Cooldown ฯ=5โ2 | 66.3% | 15.0% | 18.7% | 4.4x |
| 3-Phase Curriculum | 74.2% | 8.9% | 11.0% | 9.8x |
The 3-phase curriculum provides the best balance: maintains strong Tier 1 coverage while significantly boosting Tier 3 exposure from 1.7% โ 11%.
Files
| File | Description |
|---|---|
CURRICULUM_LEARNING_STRATEGY.md |
Comprehensive strategy document with all hyperparameters |
curriculum_sampler.py |
Core module: temperature sampling, Cooldown schedule, CurriculumOrchestrator |
train_gemma4_indian_asr.py |
Training script for Gemma 4 E4B with curriculum learning |
train_qwen3_asr_indian.py |
Training script for Qwen3-ASR with curriculum learning |
ablation_sweep.py |
Ablation comparison of 8 sampling strategies |
Language Tiers
Tier 1: HIGH Resource (1,400-8,500h)
Hindi, Tamil, Marathi, Kannada, Odia, Bengali, Telugu, Gujarati, Malayalam, Punjabi
Tier 2: MID Resource (320-550h)
Urdu, Assamese, Nepali, Sanskrit
Tier 3: LOW Resource (90-210h)
Maithili, Dogri, Konkani, Sindhi, Bodo, Manipuri, Kashmiri, Santali
Three-Phase Training Strategy
Phase 1: Foundation Alignment (10% of steps)
โ Train on Tier 1 only, ฯ=1.0
โ Build encoder-decoder alignment on clean, abundant data
Phase 2: Aggressive Multilingual Expansion (50% of steps)
โ All 22 languages, Cooldown ฯ: 5.0 โ 2.0
โ Low-resource languages get ~3% each vs ~0.3% under proportional
Phase 3: Refinement & Consolidation (40% of steps)
โ All 22 languages, ฯ=1.5 + 10% replay buffer
โ Language-specific adapters prevent interference
Datasets Used
- IndicVoices (
ai4bharat/IndicVoices) โ all 22 languages - Shrutilipi (
ai4bharat/Shrutilipi) โ 16 languages, broadcast domain - Kathbath (
ai4bharat/Kathbath) โ 12 languages, read speech - FLEURS (
google/fleurs) โ 14 Indian languages, evaluation - Common Voice โ variable coverage
Key Papers
- Cooldown (2410.04579) โ Dynamic temperature scheduling
- UniMax (2304.09151) โ Cap-based uniform sampling
- OmniASR (2511.09690) โ ฮฒ-based upsampling for 1600+ languages
- MMS (2305.13516) โ Language-specific adapters (LSAH)
- Zipper-LoRA (2603.17558) โ Rank-level LoRA decoupling for multilingual ASR
- WavLLM (2404.00656) โ Two-stage curriculum learning
- Polyglot-Lion (2603.16184) โ Balanced multilingual ASR fine-tuning
- CL for LLMs (2506.11300) โ Curriculum learning as warmup
Quick Start
from curriculum_sampler import CurriculumConfig, CurriculumOrchestrator
config = CurriculumConfig(
phase2_tau_start=5.0,
phase2_tau_end=2.0,
replay_ratio=0.10,
)
orchestrator = CurriculumOrchestrator(config, total_steps=100000)
# Get sampling probabilities at any training step
probs = orchestrator.get_sampling_probabilities(step=50000)
# Returns: {"Hindi": 0.075, "Tamil": 0.060, ..., "Santali": 0.030}
License
Apache 2.0