FairHindiSER: Fair and Robust Speech Emotion Recognition

This repository contains the checkpoints for FairHindiSER, a speech emotion recognition model trained on a 50/50 mix of Hindi (IITKGP) and English (IEMOCAP) emotional speech. The model predicts four emotions: angry, happy, neutral, sad.

The backbone is facebook/wav2vec2-base, adapted with:

  • A FairSER MLP head on top of pooled wav2vec2 features.
  • LoRA adapters on the top transformer layers.
  • CLUES-style contrastive debiasing to reduce gaps across language/gender groups.
  • Gradual full unfreezing of the encoder with Optuna-tuned learning rates.

Checkpoints in this repo

  • checkpoints/head_best.pt โ€” head-only fine-tuning (backbone frozen).
  • checkpoints/lora_best.pt โ€” LoRA fine-tuning with focal loss and class weights.
  • checkpoints/clues_lora_best.pt โ€” LoRA + CLUES contrastive debiasing.
  • checkpoints/full_best.pt โ€” final model with gradual full unfreezing.

You can load the final model like this:

import torch
from transformers import Wav2Vec2Model
from fairhindiser import FairSERModel  # your model class, if you publish it as a pip package

ckpt = torch.load("checkpoints/full_best.pt", map_location="cpu")
model = FairSERModel()
model.load_state_dict(ckpt)
model.eval()

(Adapt the import path to your own project structure.)

Intended use

  • Research on cross-lingual and fair speech emotion recognition.
  • Analysis of robustness and calibration under common audio corruptions (noise, speed, pitch perturbations).
  • As a backbone for downstream SER systems in multilingual / accented settings.

Not intended for high-stakes decisions or medical/psychological diagnosis.

Training data

  • Hindi: IITKGP Hindi corpus (naturalistic, acted dialogues).
  • English: IEMOCAP 4-class subset (angry, happy/excited, neutral, sad).

All audio was resampled to 16 kHz mono and normalized. The combined dataset contains 3200 Hindi + 3200 English clips, stratified into train/val/test splits.

Evaluation

We report metrics on the held-out test set and AudioTrust-style axes:

  • Per-class F1 and confusion matrix.
  • Group F1 by language, gender and accent.
  • Robustness under additive noise, speed and pitch perturbations.
  • Calibration & privacy proxies from confidence distributions.

(See the associated paper / report for full numbers and plots.)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Saumya3007/spee_project_fairhindiser-clues

Finetuned
(952)
this model