Pratilekha by Alchemyst AI - Lightweight Indic STT Model

Indic STT model based on OpenAI Whisper Tiny model using LoRA adapters for multilingual Indic speech-to-text transcription with automatic language detection. Trained on noisy, real-world audio data, Pratilekha handles diverse acoustic conditions, seamless language/accent switching, and achieves ~20% better accuracy than OpenAI's Whisper-1 on Indic language benchmarks.

Model Details

Model Description

This model is a LoRA-adapted version of OpenAI's Whisper Tiny, fine-tuned on noisy audio data for robust transcription of speech in multiple Indic languages. It excels at effortless switching between languages and accents within and across utterances, making it well-suited for code-mixed and multi-accent Indic speech. Despite its lightweight architecture, Pratilekha delivers approximately 20% improved transcription accuracy over OpenAI's Whisper-1 on Indic language tasks. It supports automatic language detection and can be served via a WebSocket-based FastAPI server for real-time transcription.

  • Developed by: Anuran Roy
  • Model type: Seq2Seq Speech-to-Text (LoRA fine-tuned)
  • Language(s) (NLP): Hindi (hi), English (en), Bengali (bn), Tamil (ta), Telugu (te), Marathi (mr), Gujarati (gu), Kannada (kn), Malayalam (ml), Punjabi (pa), Urdu (ur), Odia (or)
  • License: CC-BY-SA-4.0
  • Finetuned from model: openai/whisper-tiny

Uses

Direct Use

The model can be used directly for speech-to-text transcription of Indic language audio files. It supports both explicit language specification and automatic language detection. Inputs are taken at 16 kHz frequency (pcm16)

Downstream Use

The model can be integrated into voice agent pipelines, real-time transcription services, or any application requiring Indic language speech recognition. A WebSocket-based FastAPI server is provided for real-time inference.

Out-of-Scope Use

  • Languages not listed in the supported languages above.
  • Noisy or very low-quality audio recordings may produce poor results.
  • Not intended for speaker identification or diarization.

How to Get Started with the Model

Python Inference

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa
from typing import Optional

def load_finetuned_model(
  adapter_path: str = "whisper-indic-voice-agent-tiny",
  base_model: str = "openai/whisper-tiny"
):
  """Load the fine-tuned Whisper model with LoRA adapters"""
  processor = WhisperProcessor.from_pretrained(adapter_path)
  model = WhisperForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto"
  )
  model = PeftModel.from_pretrained(model, adapter_path)
  model.eval()
  return model, processor

# Load model
model, processor = load_finetuned_model()

# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process and transcribe
input_features = processor(
  audio, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device, dtype=torch.float16)

with torch.no_grad():
  predicted_ids = model.generate(
    input_features,
    max_length=448,
    language="hi",  # Set to None for auto-detection
    task="transcribe"
  )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

WebSocket Server

A FastAPI-based WebSocket server is included for real-time transcription:

python server.py

The server exposes:

  • GET /health — Health check endpoint.
  • WS /ws/transcribe — WebSocket endpoint for streaming transcription.

WebSocket Message Format

Request:

{
  "audio": {
  "data": "<base64-encoded-audio>",
  "encoding": "audio/wav",
  "sample_rate": "16000"
  }
}

Response (events + data):

{"type": "events", "data": {"signal_type": "START_SPEECH", "occured_at": 1234567890.0, "session_id": "..."}}
{"type": "events", "data": {"signal_type": "END_SPEECH", "occured_at": 1234567890.0, "session_id": "..."}}
{
  "type": "data",
  "data": {
  "request_id": "...",
  "transcript": "transcribed text",
  "language_code": "hi-IN",
  "metrics": {
    "audio_duration": 3.5,
    "processing_latency": 0.42
  }
  }
}

Training Details

Training Procedure

Training is executed via the provided shell script:

bash run_training.sh

This script:

  1. Creates and activates a Python virtual environment.
  2. Installs dependencies from requirements.txt.
  3. Runs train_file.py with GPU support (CUDA_VISIBLE_DEVICES=0).
  4. Saves the fine-tuned model to ./whisper-indic-voice-agent-tiny.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Adapter method: LoRA (via PEFT)

Bias, Risks, and Limitations

  • Performance may vary across supported languages depending on training data distribution.
  • Auto language detection may be inaccurate for short audio clips or code-mixed speech.
  • The Whisper Tiny base model has limited capacity; expect lower accuracy compared to larger Whisper variants.

Recommendations

Users should validate transcription quality on their target language and domain before deploying in production. Consider using a larger base model (e.g., whisper-small or whisper-medium) for improved accuracy.

Technical Specifications

Model Architecture and Objective

  • Base Architecture: OpenAI Whisper Tiny (encoder-decoder transformer)
  • Fine-tuning Method: LoRA (Low-Rank Adaptation) via PEFT
  • Objective: Seq2Seq speech transcription

Compute Infrastructure

Hardware

  • GPU: CUDA-compatible GPU (single GPU training via CUDA_VISIBLE_DEVICES=0). We used 1x L4 GPU to train this over 3 GPU hours.

Software

  • Python 3.x
  • PyTorch with CUDA
  • Hugging Face Transformers
  • PEFT 0.17.1
  • librosa, soundfile
  • FastAPI + Uvicorn (for serving)

Model Card Authors

Anuran Roy

Framework versions

  • PEFT 0.17.1
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anuran-roy/pratilekha-v0

Adapter
(64)
this model

Collection including anuran-roy/pratilekha-v0