AVERFormer (RAVDESS)
Multimodal Audio-Visual Emotion Recognition transformer trained on RAVDESS.
Source checkpoint: best_model_epoch_161.pth
Classes: ['neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
Inputs
| Modality | Shape | Notes |
|---|---|---|
audio_waveform |
[B, 2, 88000] |
stereo, 22000 Hz, 4 s |
mel_spectrogram |
[B, 320, 343] |
160 mels per channel, stereo concatenated |
mfcc |
[B, 40, 343] |
20 MFCC per channel, stereo concatenated |
video |
[B, 3, 12, 180, 320] |
12 uniformly sampled frames, ImageNet-normalised |
See repo https://github.com/anthropics/AVERFormer (or your fork) for the matching
model code under models/averformer.py.
Loading
import torch
from huggingface_hub import hf_hub_download
from models.averformer import AVERFormer
ckpt_path = hf_hub_download(repo_id="mhussainahmad/averformer-ravdess", filename="pytorch_model.pth")
model = AVERFormer(num_classes=8)
state = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(state.get("model_state_dict", state), strict=False)
model.eval()
- Downloads last month
- 25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support