AVERFormer (RAVDESS)

Multimodal Audio-Visual Emotion Recognition transformer trained on RAVDESS.

Source checkpoint: best_model_epoch_161.pth Classes: ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']

Inputs

Modality Shape Notes
audio_waveform [B, 2, 88000] stereo, 22000 Hz, 4 s
mel_spectrogram [B, 320, 343] 160 mels per channel, stereo concatenated
mfcc [B, 40, 343] 20 MFCC per channel, stereo concatenated
video [B, 3, 12, 180, 320] 12 uniformly sampled frames, ImageNet-normalised

See repo https://github.com/anthropics/AVERFormer (or your fork) for the matching model code under models/averformer.py.

Loading

import torch
from huggingface_hub import hf_hub_download
from models.averformer import AVERFormer

ckpt_path = hf_hub_download(repo_id="mhussainahmad/averformer-ravdess", filename="pytorch_model.pth")
model = AVERFormer(num_classes=8)
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(state.get("model_state_dict", state), strict=False)
model.eval()
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support