drbaph commited on 20 days ago

Commit

7c386ab

0 Parent(s):

Duplicate from drbaph/Woosh

Browse files

Co-authored-by: DRBAPH <drbaph@users.noreply.huggingface.co>

Files changed (17) hide show

.gitattributes +36 -0
ComfyUI-Woosh-example.mp4 +3 -0
README.md +101 -0
TextConditionerA/config.yaml +22 -0
TextConditionerA/weights.safetensors +3 -0
TextConditionerV/config.yaml +22 -0
TextConditionerV/weights.safetensors +3 -0
Woosh-AE/config.yaml +27 -0
Woosh-AE/weights.safetensors +3 -0
Woosh-DFlow/config.yaml +51 -0
Woosh-DFlow/weights.safetensors +3 -0
Woosh-DVFlow-8s/config.yaml +64 -0
Woosh-DVFlow-8s/weights.safetensors +3 -0
Woosh-Flow/config.yaml +48 -0
Woosh-Flow/weights.safetensors +3 -0
Woosh-VFlow-8s/config.yaml +60 -0
Woosh-VFlow-8s/weights.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+ComfyUI-Woosh-example.mp4 filter=lfs diff=lfs merge=lfs -text

ComfyUI-Woosh-example.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb9d2624d4b8f17def1ce69a821ac966dadeedefdcb4ecbf8ec1fcc5c56b1964
+size 4379001

README.md ADDED Viewed

	@@ -0,0 +1,101 @@

+---
+license: cc-by-nc-4.0
+language:
+- en
+pipeline_tag: text-to-audio
+tags:
+- t2a
+- v2a
+- text-to-audio
+- video-to-audio
+- woosh
+- comfyui
+- diffusion
+- audio
+- flow-matching
+---
+ # Woosh — Sound Effect Generative Models
+  Inference code and open weights for sound effect generative models developed at Sony AI.
+  [![GitHub](https://img.shields.io/badge/GitHub-SonyResearch%2FWoosh-black)](https://github.com/SonyResearch/Woosh)
+  [![ComfyUI
+  Node](https://img.shields.io/badge/ComfyUI-ComfyUI--Woosh-blue)](https://github.com/Saganaki22/ComfyUI-Woosh)
+  [![arXiv](https://img.shields.io/badge/arXiv-2502.07359-b31b1b)](https://arxiv.org/abs/2502.07359)
+![Screenshot 2026-04-12 013347](https://cdn-uploads.huggingface.co/production/uploads/63473b59e5c0717e6737b872/kafOo1f9eZYfyyHgcbzPj.png)
+<video controls width="100%">
+  <source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
+  Your browser does not support the video tag.
+</video>
+  ## Models
+  | Model | Task | Steps | CFG | Description |
+  |-------|------|-------|-----|-------------|
+  | **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
+  | **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
+  | **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
+  | **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
+  ### Components
+  - **Woosh-AE** — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
+  generated latents.
+  - **Woosh-CLAP (TextConditionerA/V)** — Multimodal text-audio alignment model. Provides token latents for diffusion
+  model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
+  - **Woosh-Flow / Woosh-DFlow** — Original and distilled LDMs for text-to-audio generation.
+  - **Woosh-VFlow** — Multimodal LDM generating audio from video with optional text prompts.
+  ## ComfyUI Nodes
+  Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
+  [ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
+  ```bash
+  # Via ComfyUI Manager — search "Woosh" and click Install
+  # Or manually:
+  cd ComfyUI/custom_nodes
+  git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
+  pip install -r ComfyUI-Woosh/requirements.txt
+  ```
+  Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
+  README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
+  > **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
+  ## Inference
+  See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
+  details.
+  ## VRAM Requirements
+  | Model | VRAM (Approx) |
+  |-------|---------------|
+  | Flow / VFlow | ~8-12 GB |
+  | DFlow / DVFlow | ~4-6 GB |
+  | With CPU offload | ~2-4 GB |
+  ## Citation
+  ```bibtex
+  @article{saghibakshi2025woosh,
+        title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
+        author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
+  Kawakami, Kazuhiro and Gu, Yuxuan},
+        journal={arXiv preprint arXiv:2502.07359},
+        year={2025}
+  }
+  ```
+  ## License
+  - **Code** — Apache 2.0
+  - **Model Weights** — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)

TextConditionerA/config.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+exclude_from_checkpoint: false
+trainable: false
+sentence_config:
+  frozen: false
+  model: roberta-large
+  max_sentence_tokens: 77
+  adopt_n_layers: 0
+  adopt_layer_size: 2048
+  pool_type: eos
+  add_pooling_layer: true
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  finetune_n_layers: -1
+last_hidden_state: true
+use_shared_space: false
+normalize_shared_space: true
+freeze_clap: true
+lhs_index: -2
+remove_special_tokens: false
+eval_mode: true
+text_preprocessing: no_op
+shared_representation_size: 1024

TextConditionerA/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9f66e1877efd49e3198068ffac74d00a84df18e53189643373e2e777224856bb
+size 1425689504

TextConditionerV/config.yaml ADDED Viewed

	@@ -0,0 +1,22 @@

+exclude_from_checkpoint: false
+trainable: false
+sentence_config:
+  frozen: false
+  model: roberta-large
+  max_sentence_tokens: 77
+  adopt_n_layers: 0
+  adopt_layer_size: 2048
+  pool_type: eos
+  add_pooling_layer: true
+  hidden_dropout_prob: 0.1
+  attention_probs_dropout_prob: 0.1
+  finetune_n_layers: -1
+last_hidden_state: true
+use_shared_space: false
+normalize_shared_space: true
+freeze_clap: true
+lhs_index: -2
+remove_special_tokens: false
+eval_mode: true
+text_preprocessing: no_op
+shared_representation_size: 1024

TextConditionerV/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0dd912c322582cb95531de69cf1472c90752e468087a3b0fc461dea49975e80
+size 1425689504

Woosh-AE/config.yaml ADDED Viewed

	@@ -0,0 +1,27 @@

+exclude_from_checkpoint: true
+trainable: true
+istft_head:
+  _target_: woosh.module.model.vocos.ISTFTCircleHead
+  _partial_: true
+  padding: center
+  conv_pad: same
+  conv_kernel: 1
+  softclip: softplus
+ztransform:
+  _target_: woosh.module.model.vocos.ZeroDropoutTransform
+  _partial_: true
+  p: 1.0
+_target_: woosh.module.model.VocosAutoEncoder
+channels: 1
+z_dim: 128
+d_model: 2048
+intermediate_dim: 3072
+n_fft: 960
+hop_length: 480
+num_layers: 8
+enc_num_layers: 8
+input_layer_norm: true
+final_layer_norm: true
+stft_normalized: false
+spec_embed: stft-complex
+sample_rate: 48000

Woosh-AE/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48e0d864beee90b17d232c74b16e9878dadc73b4ae55e4f6939df3f559c5d88a
+size 884664420

Woosh-DFlow/config.yaml ADDED Viewed

	@@ -0,0 +1,51 @@

+exclude_from_checkpoint: false
+trainable: true
+ldm:
+  exclude_from_checkpoint: false
+  trainable: true
+  model_type: LatentDiffusionModel
+  dit:
+    model_type: mmmssflux
+    rope_len_multiplier: 2
+    max_description_length: 77
+    max_seq_len: 501
+    dim: 1024
+    inter_dim: 4096
+    fixed_timestep_features: false
+    timestep_features_dim: 256
+    n_layers: 12
+    n_heads: 8
+    n_multimodal_layers: 6
+    qk_nope_head_dim: 16
+    qk_rope_head_dim: 112
+    qkv_head_dim: 128
+    n_memory_tokens_rope: 1
+    n_memory_tokens_description: 0
+    original_seq_len: 501
+    rope_theta: 10000.0
+    rope_factor: 40.0
+    beta_fast: 32
+    beta_slow: 1
+    io_channels: 128
+    cond_token_dim: 1024
+    adaln_last_layer: true
+    adaln_last_layer_nomod: false
+    non_checkpoint_layers: 14
+    mask_out_before: -1
+    estimate_logvar: true
+    no_description_mask: true
+    symmetric_attention_init: false
+    patch_size: 1
+    num_sinks: 0
+    mlp_act: gelu
+  conditioners:
+    text:
+      exclude_from_checkpoint: true
+      trainable: false
+      path: checkpoints/TextConditionerA
+  autoencoder:
+    exclude_from_checkpoint: true
+    trainable: false
+    path: checkpoints/Woosh-AE
+  sigma_data: 1.0
+  pred_type: v_pred

Woosh-DFlow/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9fc4a2220fa4ba54e01611afe58e061854a58ad559eb56de2cca39112f7c8ef
+size 1378890620

Woosh-DVFlow-8s/config.yaml ADDED Viewed

	@@ -0,0 +1,64 @@

+exclude_from_checkpoint: false
+trainable: true
+pretrained_model_type: videokontext
+ldm:
+  exclude_from_checkpoint: false
+  trainable: true
+  model_type: VideoKontextLDM
+  ldm:
+    exclude_from_checkpoint: false
+    trainable: true
+    dit:
+      model_type: mmmssflux
+      max_description_length: 77
+      max_seq_len: 501
+      rope_len_multiplier: 2
+      dim: 1024
+      inter_dim: 4096
+      fixed_timestep_features: false
+      timestep_features_dim: 256
+      n_layers: 12
+      n_heads: 8
+      n_multimodal_layers: 6
+      qk_nope_head_dim: 16
+      qk_rope_head_dim: 112
+      qkv_head_dim: 128
+      n_memory_tokens_rope: 1
+      n_memory_tokens_description: 0
+      original_seq_len: 501
+      rope_theta: 10000.0
+      rope_factor: 40.0
+      beta_fast: 32
+      beta_slow: 1
+      io_channels: 128
+      cond_token_dim: 1024
+      adaln_last_layer: true
+      adaln_last_layer_nomod: false
+      non_checkpoint_layers: 14
+      mask_out_before: -1
+      estimate_logvar: true
+      no_description_mask: true
+      symmetric_attention_init: false
+      patch_size: 1
+      num_sinks: 0
+      mlp_act: gelu
+    conditioners:
+      text:
+        exclude_from_checkpoint: true
+        trainable: false
+        path: checkpoints/TextConditionerV
+    autoencoder:
+      exclude_from_checkpoint: true
+      trainable: false
+      path: checkpoints/Woosh-AE
+    sigma_data: 1.0
+    pred_type: v_pred
+  audio_fps: 100
+  video_fps: 24
+  embed_dim: 768
+  embed_key: synch_out
+  pts_seconds_key: synch_pts_seconds
+  non_checkpoint_layers: 14
+  n_layers_encoder: 0
+  trainable_no_cond: false
+  use_batch_mask: true

Woosh-DVFlow-8s/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c87b796798fd96a4a3b1deeb70ccd05c6dc782fe546b622ce1368a05e441cd30
+size 1684261500

Woosh-Flow/config.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+exclude_from_checkpoint: false
+trainable: true
+model_type: LatentDiffusionModel
+dit:
+  model_type: mmmssflux
+  rope_len_multiplier: 2
+  max_description_length: 77
+  max_seq_len: 501
+  dim: 1024
+  inter_dim: 4096
+  fixed_timestep_features: false
+  timestep_features_dim: 256
+  n_layers: 12
+  n_heads: 8
+  n_multimodal_layers: 6
+  qk_nope_head_dim: 16
+  qk_rope_head_dim: 112
+  qkv_head_dim: 128
+  n_memory_tokens_rope: 1
+  n_memory_tokens_description: 0
+  original_seq_len: 501
+  rope_theta: 10000.0
+  rope_factor: 40.0
+  beta_fast: 32
+  beta_slow: 1
+  io_channels: 128
+  cond_token_dim: 1024
+  adaln_last_layer: true
+  adaln_last_layer_nomod: false
+  non_checkpoint_layers: 14
+  mask_out_before: -1
+  estimate_logvar: true
+  no_description_mask: true
+  symmetric_attention_init: false
+  patch_size: 1
+  num_sinks: 0
+  mlp_act: gelu
+conditioners:
+  text:
+    exclude_from_checkpoint: true
+    trainable: false
+    path: checkpoints/TextConditionerA
+autoencoder:
+  exclude_from_checkpoint: true
+  trainable: false
+  path: checkpoints/Woosh-AE
+sigma_data: 1.0
+pred_type: v_pred

Woosh-Flow/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e2eafa2f9063934859888a2edd25ad4485f82dca4caaa7720f98d832f7ca39b
+size 1348979404

Woosh-VFlow-8s/config.yaml ADDED Viewed

	@@ -0,0 +1,60 @@

+exclude_from_checkpoint: false
+trainable: true
+model_type: VideoKontextLDM
+ldm:
+  exclude_from_checkpoint: false
+  trainable: true
+  dit:
+    model_type: mmmssflux
+    max_description_length: 77
+    max_seq_len: 501
+    rope_len_multiplier: 2
+    dim: 1024
+    inter_dim: 4096
+    fixed_timestep_features: false
+    timestep_features_dim: 256
+    n_layers: 12
+    n_heads: 8
+    n_multimodal_layers: 6
+    qk_nope_head_dim: 16
+    qk_rope_head_dim: 112
+    qkv_head_dim: 128
+    n_memory_tokens_rope: 1
+    n_memory_tokens_description: 0
+    original_seq_len: 501
+    rope_theta: 10000.0
+    rope_factor: 40.0
+    beta_fast: 32
+    beta_slow: 1
+    io_channels: 128
+    cond_token_dim: 1024
+    adaln_last_layer: true
+    adaln_last_layer_nomod: false
+    non_checkpoint_layers: 14
+    mask_out_before: -1
+    estimate_logvar: true
+    no_description_mask: true
+    symmetric_attention_init: false
+    patch_size: 1
+    num_sinks: 0
+    mlp_act: gelu
+  conditioners:
+    text:
+      exclude_from_checkpoint: true
+      trainable: false
+      path: checkpoints/TextConditionerV
+  autoencoder:
+    exclude_from_checkpoint: true
+    trainable: false
+    path: checkpoints/Woosh-AE
+  sigma_data: 1.0
+  pred_type: v_pred
+audio_fps: 100
+video_fps: 24
+embed_dim: 768
+embed_key: synch_out
+pts_seconds_key: synch_pts_seconds
+non_checkpoint_layers: 14
+n_layers_encoder: 0
+trainable_no_cond: false
+use_batch_mask: true

Woosh-VFlow-8s/weights.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ed2e7639e830060e532c955d4b4dc23869c8574605267846fa326c28730fcc7d
+size 1654350276