nikosan2026 drbaph commited on
Commit
7c386ab
·
0 Parent(s):

Duplicate from drbaph/Woosh

Browse files

Co-authored-by: DRBAPH <drbaph@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ ComfyUI-Woosh-example.mp4 filter=lfs diff=lfs merge=lfs -text
ComfyUI-Woosh-example.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb9d2624d4b8f17def1ce69a821ac966dadeedefdcb4ecbf8ec1fcc5c56b1964
3
+ size 4379001
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-audio
6
+ tags:
7
+ - t2a
8
+ - v2a
9
+ - text-to-audio
10
+ - video-to-audio
11
+ - woosh
12
+ - comfyui
13
+ - diffusion
14
+ - audio
15
+ - flow-matching
16
+ ---
17
+
18
+ # Woosh — Sound Effect Generative Models
19
+
20
+ Inference code and open weights for sound effect generative models developed at Sony AI.
21
+
22
+ [![GitHub](https://img.shields.io/badge/GitHub-SonyResearch%2FWoosh-black)](https://github.com/SonyResearch/Woosh)
23
+ [![ComfyUI
24
+ Node](https://img.shields.io/badge/ComfyUI-ComfyUI--Woosh-blue)](https://github.com/Saganaki22/ComfyUI-Woosh)
25
+ [![arXiv](https://img.shields.io/badge/arXiv-2502.07359-b31b1b)](https://arxiv.org/abs/2502.07359)
26
+
27
+
28
+
29
+ ![Screenshot 2026-04-12 013347](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/kafOo1f9eZYfyyHgcbzPj.png)
30
+
31
+
32
+ <video controls width="100%">
33
+ <source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
34
+ Your browser does not support the video tag.
35
+ </video>
36
+
37
+ ## Models
38
+
39
+ | Model | Task | Steps | CFG | Description |
40
+ |-------|------|-------|-----|-------------|
41
+ | **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
42
+ | **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
43
+ | **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
44
+ | **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
45
+
46
+ ### Components
47
+
48
+ - **Woosh-AE** — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
49
+ generated latents.
50
+ - **Woosh-CLAP (TextConditionerA/V)** — Multimodal text-audio alignment model. Provides token latents for diffusion
51
+ model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
52
+ - **Woosh-Flow / Woosh-DFlow** — Original and distilled LDMs for text-to-audio generation.
53
+ - **Woosh-VFlow** — Multimodal LDM generating audio from video with optional text prompts.
54
+
55
+ ## ComfyUI Nodes
56
+
57
+ Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
58
+ [ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
59
+
60
+ ```bash
61
+ # Via ComfyUI Manager — search "Woosh" and click Install
62
+ # Or manually:
63
+ cd ComfyUI/custom_nodes
64
+ git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
65
+ pip install -r ComfyUI-Woosh/requirements.txt
66
+ ```
67
+
68
+ Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
69
+ README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
70
+
71
+ > **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
72
+
73
+ ## Inference
74
+
75
+ See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
76
+ details.
77
+
78
+ ## VRAM Requirements
79
+
80
+ | Model | VRAM (Approx) |
81
+ |-------|---------------|
82
+ | Flow / VFlow | ~8-12 GB |
83
+ | DFlow / DVFlow | ~4-6 GB |
84
+ | With CPU offload | ~2-4 GB |
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @article{saghibakshi2025woosh,
90
+ title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
91
+ author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
92
+ Kawakami, Kazuhiro and Gu, Yuxuan},
93
+ journal={arXiv preprint arXiv:2502.07359},
94
+ year={2025}
95
+ }
96
+ ```
97
+
98
+ ## License
99
+
100
+ - **Code** — Apache 2.0
101
+ - **Model Weights** — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
TextConditionerA/config.yaml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: false
3
+ sentence_config:
4
+ frozen: false
5
+ model: roberta-large
6
+ max_sentence_tokens: 77
7
+ adopt_n_layers: 0
8
+ adopt_layer_size: 2048
9
+ pool_type: eos
10
+ add_pooling_layer: true
11
+ hidden_dropout_prob: 0.1
12
+ attention_probs_dropout_prob: 0.1
13
+ finetune_n_layers: -1
14
+ last_hidden_state: true
15
+ use_shared_space: false
16
+ normalize_shared_space: true
17
+ freeze_clap: true
18
+ lhs_index: -2
19
+ remove_special_tokens: false
20
+ eval_mode: true
21
+ text_preprocessing: no_op
22
+ shared_representation_size: 1024
TextConditionerA/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f66e1877efd49e3198068ffac74d00a84df18e53189643373e2e777224856bb
3
+ size 1425689504
TextConditionerV/config.yaml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: false
3
+ sentence_config:
4
+ frozen: false
5
+ model: roberta-large
6
+ max_sentence_tokens: 77
7
+ adopt_n_layers: 0
8
+ adopt_layer_size: 2048
9
+ pool_type: eos
10
+ add_pooling_layer: true
11
+ hidden_dropout_prob: 0.1
12
+ attention_probs_dropout_prob: 0.1
13
+ finetune_n_layers: -1
14
+ last_hidden_state: true
15
+ use_shared_space: false
16
+ normalize_shared_space: true
17
+ freeze_clap: true
18
+ lhs_index: -2
19
+ remove_special_tokens: false
20
+ eval_mode: true
21
+ text_preprocessing: no_op
22
+ shared_representation_size: 1024
TextConditionerV/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0dd912c322582cb95531de69cf1472c90752e468087a3b0fc461dea49975e80
3
+ size 1425689504
Woosh-AE/config.yaml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: true
2
+ trainable: true
3
+ istft_head:
4
+ _target_: woosh.module.model.vocos.ISTFTCircleHead
5
+ _partial_: true
6
+ padding: center
7
+ conv_pad: same
8
+ conv_kernel: 1
9
+ softclip: softplus
10
+ ztransform:
11
+ _target_: woosh.module.model.vocos.ZeroDropoutTransform
12
+ _partial_: true
13
+ p: 1.0
14
+ _target_: woosh.module.model.VocosAutoEncoder
15
+ channels: 1
16
+ z_dim: 128
17
+ d_model: 2048
18
+ intermediate_dim: 3072
19
+ n_fft: 960
20
+ hop_length: 480
21
+ num_layers: 8
22
+ enc_num_layers: 8
23
+ input_layer_norm: true
24
+ final_layer_norm: true
25
+ stft_normalized: false
26
+ spec_embed: stft-complex
27
+ sample_rate: 48000
Woosh-AE/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48e0d864beee90b17d232c74b16e9878dadc73b4ae55e4f6939df3f559c5d88a
3
+ size 884664420
Woosh-DFlow/config.yaml ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: true
3
+ ldm:
4
+ exclude_from_checkpoint: false
5
+ trainable: true
6
+ model_type: LatentDiffusionModel
7
+ dit:
8
+ model_type: mmmssflux
9
+ rope_len_multiplier: 2
10
+ max_description_length: 77
11
+ max_seq_len: 501
12
+ dim: 1024
13
+ inter_dim: 4096
14
+ fixed_timestep_features: false
15
+ timestep_features_dim: 256
16
+ n_layers: 12
17
+ n_heads: 8
18
+ n_multimodal_layers: 6
19
+ qk_nope_head_dim: 16
20
+ qk_rope_head_dim: 112
21
+ qkv_head_dim: 128
22
+ n_memory_tokens_rope: 1
23
+ n_memory_tokens_description: 0
24
+ original_seq_len: 501
25
+ rope_theta: 10000.0
26
+ rope_factor: 40.0
27
+ beta_fast: 32
28
+ beta_slow: 1
29
+ io_channels: 128
30
+ cond_token_dim: 1024
31
+ adaln_last_layer: true
32
+ adaln_last_layer_nomod: false
33
+ non_checkpoint_layers: 14
34
+ mask_out_before: -1
35
+ estimate_logvar: true
36
+ no_description_mask: true
37
+ symmetric_attention_init: false
38
+ patch_size: 1
39
+ num_sinks: 0
40
+ mlp_act: gelu
41
+ conditioners:
42
+ text:
43
+ exclude_from_checkpoint: true
44
+ trainable: false
45
+ path: checkpoints/TextConditionerA
46
+ autoencoder:
47
+ exclude_from_checkpoint: true
48
+ trainable: false
49
+ path: checkpoints/Woosh-AE
50
+ sigma_data: 1.0
51
+ pred_type: v_pred
Woosh-DFlow/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9fc4a2220fa4ba54e01611afe58e061854a58ad559eb56de2cca39112f7c8ef
3
+ size 1378890620
Woosh-DVFlow-8s/config.yaml ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: true
3
+ pretrained_model_type: videokontext
4
+ ldm:
5
+ exclude_from_checkpoint: false
6
+ trainable: true
7
+ model_type: VideoKontextLDM
8
+ ldm:
9
+ exclude_from_checkpoint: false
10
+ trainable: true
11
+ dit:
12
+ model_type: mmmssflux
13
+ max_description_length: 77
14
+ max_seq_len: 501
15
+ rope_len_multiplier: 2
16
+ dim: 1024
17
+ inter_dim: 4096
18
+ fixed_timestep_features: false
19
+ timestep_features_dim: 256
20
+ n_layers: 12
21
+ n_heads: 8
22
+ n_multimodal_layers: 6
23
+ qk_nope_head_dim: 16
24
+ qk_rope_head_dim: 112
25
+ qkv_head_dim: 128
26
+ n_memory_tokens_rope: 1
27
+ n_memory_tokens_description: 0
28
+ original_seq_len: 501
29
+ rope_theta: 10000.0
30
+ rope_factor: 40.0
31
+ beta_fast: 32
32
+ beta_slow: 1
33
+ io_channels: 128
34
+ cond_token_dim: 1024
35
+ adaln_last_layer: true
36
+ adaln_last_layer_nomod: false
37
+ non_checkpoint_layers: 14
38
+ mask_out_before: -1
39
+ estimate_logvar: true
40
+ no_description_mask: true
41
+ symmetric_attention_init: false
42
+ patch_size: 1
43
+ num_sinks: 0
44
+ mlp_act: gelu
45
+ conditioners:
46
+ text:
47
+ exclude_from_checkpoint: true
48
+ trainable: false
49
+ path: checkpoints/TextConditionerV
50
+ autoencoder:
51
+ exclude_from_checkpoint: true
52
+ trainable: false
53
+ path: checkpoints/Woosh-AE
54
+ sigma_data: 1.0
55
+ pred_type: v_pred
56
+ audio_fps: 100
57
+ video_fps: 24
58
+ embed_dim: 768
59
+ embed_key: synch_out
60
+ pts_seconds_key: synch_pts_seconds
61
+ non_checkpoint_layers: 14
62
+ n_layers_encoder: 0
63
+ trainable_no_cond: false
64
+ use_batch_mask: true
Woosh-DVFlow-8s/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c87b796798fd96a4a3b1deeb70ccd05c6dc782fe546b622ce1368a05e441cd30
3
+ size 1684261500
Woosh-Flow/config.yaml ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: true
3
+ model_type: LatentDiffusionModel
4
+ dit:
5
+ model_type: mmmssflux
6
+ rope_len_multiplier: 2
7
+ max_description_length: 77
8
+ max_seq_len: 501
9
+ dim: 1024
10
+ inter_dim: 4096
11
+ fixed_timestep_features: false
12
+ timestep_features_dim: 256
13
+ n_layers: 12
14
+ n_heads: 8
15
+ n_multimodal_layers: 6
16
+ qk_nope_head_dim: 16
17
+ qk_rope_head_dim: 112
18
+ qkv_head_dim: 128
19
+ n_memory_tokens_rope: 1
20
+ n_memory_tokens_description: 0
21
+ original_seq_len: 501
22
+ rope_theta: 10000.0
23
+ rope_factor: 40.0
24
+ beta_fast: 32
25
+ beta_slow: 1
26
+ io_channels: 128
27
+ cond_token_dim: 1024
28
+ adaln_last_layer: true
29
+ adaln_last_layer_nomod: false
30
+ non_checkpoint_layers: 14
31
+ mask_out_before: -1
32
+ estimate_logvar: true
33
+ no_description_mask: true
34
+ symmetric_attention_init: false
35
+ patch_size: 1
36
+ num_sinks: 0
37
+ mlp_act: gelu
38
+ conditioners:
39
+ text:
40
+ exclude_from_checkpoint: true
41
+ trainable: false
42
+ path: checkpoints/TextConditionerA
43
+ autoencoder:
44
+ exclude_from_checkpoint: true
45
+ trainable: false
46
+ path: checkpoints/Woosh-AE
47
+ sigma_data: 1.0
48
+ pred_type: v_pred
Woosh-Flow/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4e2eafa2f9063934859888a2edd25ad4485f82dca4caaa7720f98d832f7ca39b
3
+ size 1348979404
Woosh-VFlow-8s/config.yaml ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ exclude_from_checkpoint: false
2
+ trainable: true
3
+ model_type: VideoKontextLDM
4
+ ldm:
5
+ exclude_from_checkpoint: false
6
+ trainable: true
7
+ dit:
8
+ model_type: mmmssflux
9
+ max_description_length: 77
10
+ max_seq_len: 501
11
+ rope_len_multiplier: 2
12
+ dim: 1024
13
+ inter_dim: 4096
14
+ fixed_timestep_features: false
15
+ timestep_features_dim: 256
16
+ n_layers: 12
17
+ n_heads: 8
18
+ n_multimodal_layers: 6
19
+ qk_nope_head_dim: 16
20
+ qk_rope_head_dim: 112
21
+ qkv_head_dim: 128
22
+ n_memory_tokens_rope: 1
23
+ n_memory_tokens_description: 0
24
+ original_seq_len: 501
25
+ rope_theta: 10000.0
26
+ rope_factor: 40.0
27
+ beta_fast: 32
28
+ beta_slow: 1
29
+ io_channels: 128
30
+ cond_token_dim: 1024
31
+ adaln_last_layer: true
32
+ adaln_last_layer_nomod: false
33
+ non_checkpoint_layers: 14
34
+ mask_out_before: -1
35
+ estimate_logvar: true
36
+ no_description_mask: true
37
+ symmetric_attention_init: false
38
+ patch_size: 1
39
+ num_sinks: 0
40
+ mlp_act: gelu
41
+ conditioners:
42
+ text:
43
+ exclude_from_checkpoint: true
44
+ trainable: false
45
+ path: checkpoints/TextConditionerV
46
+ autoencoder:
47
+ exclude_from_checkpoint: true
48
+ trainable: false
49
+ path: checkpoints/Woosh-AE
50
+ sigma_data: 1.0
51
+ pred_type: v_pred
52
+ audio_fps: 100
53
+ video_fps: 24
54
+ embed_dim: 768
55
+ embed_key: synch_out
56
+ pts_seconds_key: synch_pts_seconds
57
+ non_checkpoint_layers: 14
58
+ n_layers_encoder: 0
59
+ trainable_no_cond: false
60
+ use_batch_mask: true
Woosh-VFlow-8s/weights.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed2e7639e830060e532c955d4b4dc23869c8574605267846fa326c28730fcc7d
3
+ size 1654350276