Rakancorle1/hans-10k
Viewer • Updated • 20.8k • 55
Data and model for the When Vision Speaks for Sound. Includes SFT and DPO training data, evaluation data and trained checkpoints.
Note 10K-sample DPO preference data — curing the audio-visual Clever Hans.
Note SFT data for the video-audio alignment task.
Note In-domain Thud benchmark — sync / mute / swap.
Note Out-of-domain audio-visual sync benchmark on VGGSound.