YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

๐Ÿ‡ฎ๐Ÿ‡ณ Optimal Curriculum Learning for Multilingual Indian ASR

Three-Phase Curriculum + Cooldown Temperature Sampling for 22 Scheduled Indian Languages

Overview

This repository contains a literature-grounded data design strategy for fine-tuning ASR-LLM models (Gemma 4 E4B, OmniASR, Qwen3-ASR) on all 22 scheduled Indian languages with highly imbalanced data distributions (90h โ†’ 8,500h).

The core challenge: improve low-resource language ASR without degrading high-resource performance.

Key Insight

No single static temperature works. Based on 8 key papers, we combine:

  • Cooldown dynamic temperature scheduling (ฯ„: 5โ†’2)
  • Three-phase curriculum: Foundation โ†’ Expansion โ†’ Refinement
  • Replay buffers for anti-forgetting
  • Language-specific adapters (MMS LSAH pattern)

Results from Ablation Simulation

Strategy Tier 1 (HR) Tier 2 (MR) Tier 3 (LR) Max/Min Ratio
Proportional (ฯ„=1) 95.0% 3.3% 1.7% 101x
Static ฯ„=5 57.1% 15.7% 27.2% 2.4x
Cooldown ฯ„=5โ†’2 66.3% 15.0% 18.7% 4.4x
3-Phase Curriculum 74.2% 8.9% 11.0% 9.8x

The 3-phase curriculum provides the best balance: maintains strong Tier 1 coverage while significantly boosting Tier 3 exposure from 1.7% โ†’ 11%.

Files

File Description
CURRICULUM_LEARNING_STRATEGY.md Comprehensive strategy document with all hyperparameters
curriculum_sampler.py Core module: temperature sampling, Cooldown schedule, CurriculumOrchestrator
train_gemma4_indian_asr.py Training script for Gemma 4 E4B with curriculum learning
train_qwen3_asr_indian.py Training script for Qwen3-ASR with curriculum learning
ablation_sweep.py Ablation comparison of 8 sampling strategies

Language Tiers

Tier 1: HIGH Resource (1,400-8,500h)

Hindi, Tamil, Marathi, Kannada, Odia, Bengali, Telugu, Gujarati, Malayalam, Punjabi

Tier 2: MID Resource (320-550h)

Urdu, Assamese, Nepali, Sanskrit

Tier 3: LOW Resource (90-210h)

Maithili, Dogri, Konkani, Sindhi, Bodo, Manipuri, Kashmiri, Santali

Three-Phase Training Strategy

Phase 1: Foundation Alignment (10% of steps)
  โ†’ Train on Tier 1 only, ฯ„=1.0
  โ†’ Build encoder-decoder alignment on clean, abundant data

Phase 2: Aggressive Multilingual Expansion (50% of steps)
  โ†’ All 22 languages, Cooldown ฯ„: 5.0 โ†’ 2.0
  โ†’ Low-resource languages get ~3% each vs ~0.3% under proportional
  
Phase 3: Refinement & Consolidation (40% of steps)
  โ†’ All 22 languages, ฯ„=1.5 + 10% replay buffer
  โ†’ Language-specific adapters prevent interference

Datasets Used

  • IndicVoices (ai4bharat/IndicVoices) โ€” all 22 languages
  • Shrutilipi (ai4bharat/Shrutilipi) โ€” 16 languages, broadcast domain
  • Kathbath (ai4bharat/Kathbath) โ€” 12 languages, read speech
  • FLEURS (google/fleurs) โ€” 14 Indian languages, evaluation
  • Common Voice โ€” variable coverage

Key Papers

  1. Cooldown (2410.04579) โ€” Dynamic temperature scheduling
  2. UniMax (2304.09151) โ€” Cap-based uniform sampling
  3. OmniASR (2511.09690) โ€” ฮฒ-based upsampling for 1600+ languages
  4. MMS (2305.13516) โ€” Language-specific adapters (LSAH)
  5. Zipper-LoRA (2603.17558) โ€” Rank-level LoRA decoupling for multilingual ASR
  6. WavLLM (2404.00656) โ€” Two-stage curriculum learning
  7. Polyglot-Lion (2603.16184) โ€” Balanced multilingual ASR fine-tuning
  8. CL for LLMs (2506.11300) โ€” Curriculum learning as warmup

Quick Start

from curriculum_sampler import CurriculumConfig, CurriculumOrchestrator

config = CurriculumConfig(
    phase2_tau_start=5.0,
    phase2_tau_end=2.0,
    replay_ratio=0.10,
)
orchestrator = CurriculumOrchestrator(config, total_steps=100000)

# Get sampling probabilities at any training step
probs = orchestrator.get_sampling_probabilities(step=50000)
# Returns: {"Hindi": 0.075, "Tamil": 0.060, ..., "Santali": 0.030}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support