FairHindiSER: Fair and Robust Speech Emotion Recognition
This repository contains the checkpoints for FairHindiSER, a speech emotion recognition model trained on a 50/50 mix of Hindi (IITKGP) and English (IEMOCAP) emotional speech. The model predicts four emotions: angry, happy, neutral, sad.
The backbone is facebook/wav2vec2-base,
adapted with:
- A FairSER MLP head on top of pooled wav2vec2 features.
- LoRA adapters on the top transformer layers.
- CLUES-style contrastive debiasing to reduce gaps across language/gender groups.
- Gradual full unfreezing of the encoder with Optuna-tuned learning rates.
Checkpoints in this repo
checkpoints/head_best.ptโ head-only fine-tuning (backbone frozen).checkpoints/lora_best.ptโ LoRA fine-tuning with focal loss and class weights.checkpoints/clues_lora_best.ptโ LoRA + CLUES contrastive debiasing.checkpoints/full_best.ptโ final model with gradual full unfreezing.
You can load the final model like this:
import torch
from transformers import Wav2Vec2Model
from fairhindiser import FairSERModel # your model class, if you publish it as a pip package
ckpt = torch.load("checkpoints/full_best.pt", map_location="cpu")
model = FairSERModel()
model.load_state_dict(ckpt)
model.eval()
(Adapt the import path to your own project structure.)
Intended use
- Research on cross-lingual and fair speech emotion recognition.
- Analysis of robustness and calibration under common audio corruptions (noise, speed, pitch perturbations).
- As a backbone for downstream SER systems in multilingual / accented settings.
Not intended for high-stakes decisions or medical/psychological diagnosis.
Training data
- Hindi: IITKGP Hindi corpus (naturalistic, acted dialogues).
- English: IEMOCAP 4-class subset (angry, happy/excited, neutral, sad).
All audio was resampled to 16 kHz mono and normalized. The combined dataset contains 3200 Hindi + 3200 English clips, stratified into train/val/test splits.
Evaluation
We report metrics on the held-out test set and AudioTrust-style axes:
- Per-class F1 and confusion matrix.
- Group F1 by language, gender and accent.
- Robustness under additive noise, speed and pitch perturbations.
- Calibration & privacy proxies from confidence distributions.
(See the associated paper / report for full numbers and plots.)
Model tree for Saumya3007/spee_project_fairhindiser-clues
Base model
facebook/wav2vec2-base