FairHindiSER: Fair and Robust Speech Emotion Recognition

This repository contains the checkpoints for FairHindiSER, a speech emotion recognition model trained on a 50/50 mix of Hindi (IITKGP) and English (IEMOCAP) emotional speech. The model predicts four emotions: angry, happy, neutral, sad.

The backbone is facebook/wav2vec2-base, adapted with:

A FairSER MLP head on top of pooled wav2vec2 features.
LoRA adapters on the top transformer layers.
CLUES-style contrastive debiasing to reduce gaps across language/gender groups.
Gradual full unfreezing of the encoder with Optuna-tuned learning rates.

Checkpoints in this repo

checkpoints/head_best.pt — head-only fine-tuning (backbone frozen).
checkpoints/lora_best.pt — LoRA fine-tuning with focal loss and class weights.
checkpoints/clues_lora_best.pt — LoRA + CLUES contrastive debiasing.
checkpoints/full_best.pt — final model with gradual full unfreezing.

You can load the final model like this:

import torch
from transformers import Wav2Vec2Model
from fairhindiser import FairSERModel  # your model class, if you publish it as a pip package

ckpt = torch.load("checkpoints/full_best.pt", map_location="cpu")
model = FairSERModel()
model.load_state_dict(ckpt)
model.eval()

(Adapt the import path to your own project structure.)

Intended use

Research on cross-lingual and fair speech emotion recognition.
Analysis of robustness and calibration under common audio corruptions (noise, speed, pitch perturbations).
As a backbone for downstream SER systems in multilingual / accented settings.

Not intended for high-stakes decisions or medical/psychological diagnosis.

Training data

Hindi: IITKGP Hindi corpus (naturalistic, acted dialogues).
English: IEMOCAP 4-class subset (angry, happy/excited, neutral, sad).

All audio was resampled to 16 kHz mono and normalized. The combined dataset contains 3200 Hindi + 3200 English clips, stratified into train/val/test splits.

Evaluation

We report metrics on the held-out test set and AudioTrust-style axes:

Per-class F1 and confusion matrix.
Group F1 by language, gender and accent.
Robustness under additive noise, speed and pitch perturbations.
Calibration & privacy proxies from confidence distributions.

(See the associated paper / report for full numbers and plots.)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Saumya3007/spee_project_fairhindiser-clues

Base model

facebook/wav2vec2-base

Finetuned

(952)

this model