AudioX Models (Safetensors)
.safetensors conversions of AudioX-MAF model checkpoints for use with ComfyUI-FFMPEGA.
AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.
Models
| File | Description | Size |
|---|---|---|
model.safetensors |
AudioX-MAF DiT model (full precision) | 5.19 GB |
synchformer_state_dict.safetensors |
Synchformer temporal encoder (shared with MMAudio) | 475 MB |
config.json |
Model architecture configuration | 3.3 KB |
Sources
All models were downloaded from their original sources and converted by us:
- AudioX-MAF: HKUSTAudio/AudioX-MAF
- Synchformer: Shared with MMAudio
Usage
These models are automatically downloaded by the generate_music and audio_inpaint skills in ComfyUI-FFMPEGA. No manual setup needed.
Manual installation:
ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json
Note: The
synchformer_state_dict.safetensorsis shared with MMAudio. If you already have it inComfyUI/models/mmaudio/, AudioX will reuse it automatically — no duplicate download needed.
Capabilities
| Skill | Description |
|---|---|
generate_music |
Text-to-music and video-to-music generation |
audio_inpaint |
Fill gaps, extend, or regenerate sections of audio |
License
⚠️ CC-BY-NC 4.0 — AudioX model weights are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.
Paper
AudioX: Diffusion Transformer for Anything-to-Audio Generation (ICLR 2026) arXiv:2503.10522
- Downloads last month
- 41
Model tree for AEmotionStudio/audiox-models
Base model
HKUSTAudio/AudioX