AVERFormer (RAVDESS)

Multimodal Audio-Visual Emotion Recognition transformer trained on RAVDESS.

Source checkpoint: best_model_epoch_161.pth Classes: ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']

Inputs

Modality	Shape	Notes
`audio_waveform`	`[B, 2, 88000]`	stereo, 22000 Hz, 4 s
`mel_spectrogram`	`[B, 320, 343]`	160 mels per channel, stereo concatenated
`mfcc`	`[B, 40, 343]`	20 MFCC per channel, stereo concatenated
`video`	`[B, 3, 12, 180, 320]`	12 uniformly sampled frames, ImageNet-normalised

See repo https://github.com/anthropics/AVERFormer (or your fork) for the matching model code under models/averformer.py.

Loading

import torch
from huggingface_hub import hf_hub_download
from models.averformer import AVERFormer

ckpt_path = hf_hub_download(repo_id="mhussainahmad/averformer-ravdess", filename="pytorch_model.pth")
model = AVERFormer(num_classes=8)
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(state.get("model_state_dict", state), strict=False)
model.eval()

Downloads last month: 25

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support