AudioX Models (Safetensors)

.safetensors conversions of AudioX-MAF model checkpoints for use with ComfyUI-FFMPEGA.

AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.

Models

File Description Size
model.safetensors AudioX-MAF DiT model (full precision) 5.19 GB
synchformer_state_dict.safetensors Synchformer temporal encoder (shared with MMAudio) 475 MB
config.json Model architecture configuration 3.3 KB

Sources

All models were downloaded from their original sources and converted by us:

Usage

These models are automatically downloaded by the generate_music and audio_inpaint skills in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json

Note: The synchformer_state_dict.safetensors is shared with MMAudio. If you already have it in ComfyUI/models/mmaudio/, AudioX will reuse it automatically — no duplicate download needed.

Capabilities

Skill Description
generate_music Text-to-music and video-to-music generation
audio_inpaint Fill gaps, extend, or regenerate sections of audio

License

⚠️ CC-BY-NC 4.0 — AudioX model weights are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.

Paper

AudioX: Diffusion Transformer for Anything-to-Audio Generation (ICLR 2026) arXiv:2503.10522

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEmotionStudio/audiox-models

Base model

HKUSTAudio/AudioX
Finetuned
(3)
this model

Paper for AEmotionStudio/audiox-models