AudioX Models (Safetensors)

.safetensors conversions of AudioX-MAF model checkpoints for use with ComfyUI-FFMPEGA.

AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.

Models

File	Description	Size
`model.safetensors`	AudioX-MAF DiT model (full precision)	5.19 GB
`synchformer_state_dict.safetensors`	Synchformer temporal encoder (shared with MMAudio)	475 MB
`config.json`	Model architecture configuration	3.3 KB

Sources

All models were downloaded from their original sources and converted by us:

AudioX-MAF: HKUSTAudio/AudioX-MAF
Synchformer: Shared with MMAudio

Usage

These models are automatically downloaded by the generate_music and audio_inpaint skills in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json

Note: The synchformer_state_dict.safetensors is shared with MMAudio. If you already have it in ComfyUI/models/mmaudio/, AudioX will reuse it automatically — no duplicate download needed.

Capabilities

Skill	Description
`generate_music`	Text-to-music and video-to-music generation
`audio_inpaint`	Fill gaps, extend, or regenerate sections of audio

License

⚠️ CC-BY-NC 4.0 — AudioX model weights are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.

Paper

AudioX: Diffusion Transformer for Anything-to-Audio Generation (ICLR 2026) arXiv:2503.10522

Downloads last month: 6

Safetensors

Model size

1B params

Tensor type

F32

Model tree for AEmotionStudio/audiox-models

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for AEmotionStudio/audiox-models

AudioX: Diffusion Transformer for Anything-to-Audio Generation

Paper • 2503.10522 • Published Mar 13, 2025 • 29