🇮🇳 Optimal Curriculum Learning for Multilingual Indian ASR

Three-Phase Curriculum + Cooldown Temperature Sampling for 22 Scheduled Indian Languages

Overview

This repository contains a literature-grounded data design strategy for fine-tuning ASR-LLM models (Gemma 4 E4B, OmniASR, Qwen3-ASR) on all 22 scheduled Indian languages with highly imbalanced data distributions (90h → 8,500h).

The core challenge: improve low-resource language ASR without degrading high-resource performance.

Key Insight

No single static temperature works. Based on 8 key papers, we combine:

Cooldown dynamic temperature scheduling (τ: 5→2)
Three-phase curriculum: Foundation → Expansion → Refinement
Replay buffers for anti-forgetting
Language-specific adapters (MMS LSAH pattern)

Results from Ablation Simulation

Strategy	Tier 1 (HR)	Tier 2 (MR)	Tier 3 (LR)	Max/Min Ratio
Proportional (τ=1)	95.0%	3.3%	1.7%	101x
Static τ=5	57.1%	15.7%	27.2%	2.4x
Cooldown τ=5→2	66.3%	15.0%	18.7%	4.4x
3-Phase Curriculum	74.2%	8.9%	11.0%	9.8x

The 3-phase curriculum provides the best balance: maintains strong Tier 1 coverage while significantly boosting Tier 3 exposure from 1.7% → 11%.

Files

File	Description
`CURRICULUM_LEARNING_STRATEGY.md`	Comprehensive strategy document with all hyperparameters
`curriculum_sampler.py`	Core module: temperature sampling, Cooldown schedule, CurriculumOrchestrator
`train_gemma4_indian_asr.py`	Training script for Gemma 4 E4B with curriculum learning
`train_qwen3_asr_indian.py`	Training script for Qwen3-ASR with curriculum learning
`ablation_sweep.py`	Ablation comparison of 8 sampling strategies

Language Tiers

Tier 1: HIGH Resource (1,400-8,500h)

Hindi, Tamil, Marathi, Kannada, Odia, Bengali, Telugu, Gujarati, Malayalam, Punjabi

Tier 2: MID Resource (320-550h)

Urdu, Assamese, Nepali, Sanskrit

Tier 3: LOW Resource (90-210h)

Maithili, Dogri, Konkani, Sindhi, Bodo, Manipuri, Kashmiri, Santali

Three-Phase Training Strategy

Phase 1: Foundation Alignment (10% of steps)
  → Train on Tier 1 only, τ=1.0
  → Build encoder-decoder alignment on clean, abundant data

Phase 2: Aggressive Multilingual Expansion (50% of steps)
  → All 22 languages, Cooldown τ: 5.0 → 2.0
  → Low-resource languages get ~3% each vs ~0.3% under proportional
  
Phase 3: Refinement & Consolidation (40% of steps)
  → All 22 languages, τ=1.5 + 10% replay buffer
  → Language-specific adapters prevent interference

Datasets Used

IndicVoices (ai4bharat/IndicVoices) — all 22 languages
Shrutilipi (ai4bharat/Shrutilipi) — 16 languages, broadcast domain
Kathbath (ai4bharat/Kathbath) — 12 languages, read speech
FLEURS (google/fleurs) — 14 Indian languages, evaluation
Common Voice — variable coverage

Key Papers

Cooldown (2410.04579) — Dynamic temperature scheduling
UniMax (2304.09151) — Cap-based uniform sampling
OmniASR (2511.09690) — β-based upsampling for 1600+ languages
MMS (2305.13516) — Language-specific adapters (LSAH)
Zipper-LoRA (2603.17558) — Rank-level LoRA decoupling for multilingual ASR
WavLLM (2404.00656) — Two-stage curriculum learning
Polyglot-Lion (2603.16184) — Balanced multilingual ASR fine-tuning
CL for LLMs (2506.11300) — Curriculum learning as warmup

Quick Start

from curriculum_sampler import CurriculumConfig, CurriculumOrchestrator

config = CurriculumConfig(
    phase2_tau_start=5.0,
    phase2_tau_end=2.0,
    replay_ratio=0.10,
)
orchestrator = CurriculumOrchestrator(config, total_steps=100000)

# Get sampling probabilities at any training step
probs = orchestrator.get_sampling_probabilities(step=50000)
# Returns: {"Hindi": 0.075, "Tamil": 0.060, ..., "Santali": 0.030}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support