AVERFormer-v4 (RAVDESS)
Multimodal Audio-Visual-Text Emotion Recognition transformer trained on RAVDESS. Code: https://github.com/mhussainahmad/AVERFormer
Reported numbers
- Best single-seed val wF1: 65.9090909090909 (seed 42, epoch 19)
- Best ensemble wF1: 0.7887 (top-3 ensemble)
Classes (7)
['neutral', 'happy', 'sad', 'angry', 'fearful', 'disgust', 'surprised']
Protocol: 7-class RAVDESS (calm dropped, consistent with cross-corpus AVR convention).
Architecture
- Audio:
microsoft/wavlm-large(16 kHz mono waveform) - Video:
MCG-NJU/videomae-large(16 frames @ 224x224 RGB) - Text:
microsoft/deberta-v3-large(speaker-aware ctx encoder) - Fusion: 2-layer cross-modal transformer, dim=512, 8 heads
- Heads: face / voice / text / joint (all share class count)
Loading
import json, torch
from huggingface_hub import hf_hub_download
from models.averformer_v4 import AVERFormerV4
cfg = json.load(open(hf_hub_download(repo_id="mhussainahmad/averformer-ravdess-v4", filename="config.json")))
ckpt = hf_hub_download(repo_id="mhussainahmad/averformer-ravdess-v4", filename="pytorch_model.pth")
model = AVERFormerV4(
audio_backbone=cfg["audio_backbone"],
video_backbone=cfg["video_backbone"],
text_backbone=cfg["text_backbone"],
num_classes=cfg["num_classes"],
fusion_layers=cfg["fusion_layers"],
lora_r=cfg["lora_r"],
use_text=True,
)
state = torch.load(ckpt, map_location="cpu", weights_only=False)
model.load_state_dict(state["model"], strict=False)
model.eval()
Live inference
python live_emotion_v4.py --repo_id mhussainahmad/averformer-ravdess-v4
See LIVE_INFERENCE_README.md in the GitHub repo for full setup.
- Downloads last month
- 83
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support