Commit ·
7c386ab
0
Parent(s):
Duplicate from drbaph/Woosh
Browse filesCo-authored-by: DRBAPH <drbaph@users.noreply.huggingface.co>
- .gitattributes +36 -0
- ComfyUI-Woosh-example.mp4 +3 -0
- README.md +101 -0
- TextConditionerA/config.yaml +22 -0
- TextConditionerA/weights.safetensors +3 -0
- TextConditionerV/config.yaml +22 -0
- TextConditionerV/weights.safetensors +3 -0
- Woosh-AE/config.yaml +27 -0
- Woosh-AE/weights.safetensors +3 -0
- Woosh-DFlow/config.yaml +51 -0
- Woosh-DFlow/weights.safetensors +3 -0
- Woosh-DVFlow-8s/config.yaml +64 -0
- Woosh-DVFlow-8s/weights.safetensors +3 -0
- Woosh-Flow/config.yaml +48 -0
- Woosh-Flow/weights.safetensors +3 -0
- Woosh-VFlow-8s/config.yaml +60 -0
- Woosh-VFlow-8s/weights.safetensors +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
ComfyUI-Woosh-example.mp4 filter=lfs diff=lfs merge=lfs -text
|
ComfyUI-Woosh-example.mp4
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fb9d2624d4b8f17def1ce69a821ac966dadeedefdcb4ecbf8ec1fcc5c56b1964
|
| 3 |
+
size 4379001
|
README.md
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-to-audio
|
| 6 |
+
tags:
|
| 7 |
+
- t2a
|
| 8 |
+
- v2a
|
| 9 |
+
- text-to-audio
|
| 10 |
+
- video-to-audio
|
| 11 |
+
- woosh
|
| 12 |
+
- comfyui
|
| 13 |
+
- diffusion
|
| 14 |
+
- audio
|
| 15 |
+
- flow-matching
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# Woosh — Sound Effect Generative Models
|
| 19 |
+
|
| 20 |
+
Inference code and open weights for sound effect generative models developed at Sony AI.
|
| 21 |
+
|
| 22 |
+
[](https://github.com/SonyResearch/Woosh)
|
| 23 |
+
[](https://github.com/Saganaki22/ComfyUI-Woosh)
|
| 25 |
+
[](https://arxiv.org/abs/2502.07359)
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+

|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
<video controls width="100%">
|
| 33 |
+
<source src="https://huggingface.co/drbaph/Woosh/resolve/main/ComfyUI-Woosh-example.mp4" type="video/mp4">
|
| 34 |
+
Your browser does not support the video tag.
|
| 35 |
+
</video>
|
| 36 |
+
|
| 37 |
+
## Models
|
| 38 |
+
|
| 39 |
+
| Model | Task | Steps | CFG | Description |
|
| 40 |
+
|-------|------|-------|-----|-------------|
|
| 41 |
+
| **Woosh-Flow** | Text-to-Audio | 50 | 4.5 | Base model, best quality |
|
| 42 |
+
| **Woosh-DFlow** | Text-to-Audio | 4 | 1.0 | Distilled Flow, fast generation |
|
| 43 |
+
| **Woosh-VFlow** | Video-to-Audio | 50 | 4.5 | Base video-to-audio model |
|
| 44 |
+
| **Woosh-DVFlow** | Video-to-Audio | 4 | 1.0 | Distilled VFlow, fast video-to-audio |
|
| 45 |
+
|
| 46 |
+
### Components
|
| 47 |
+
|
| 48 |
+
- **Woosh-AE** — High-quality latent encoder/decoder. Provides latents for generative modeling and decodes audio from
|
| 49 |
+
generated latents.
|
| 50 |
+
- **Woosh-CLAP (TextConditionerA/V)** — Multimodal text-audio alignment model. Provides token latents for diffusion
|
| 51 |
+
model conditioning. TextConditionerA for T2A, TextConditionerV for V2A.
|
| 52 |
+
- **Woosh-Flow / Woosh-DFlow** — Original and distilled LDMs for text-to-audio generation.
|
| 53 |
+
- **Woosh-VFlow** — Multimodal LDM generating audio from video with optional text prompts.
|
| 54 |
+
|
| 55 |
+
## ComfyUI Nodes
|
| 56 |
+
|
| 57 |
+
Use these models in [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with
|
| 58 |
+
[ComfyUI-Woosh](https://github.com/Saganaki22/ComfyUI-Woosh):
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
# Via ComfyUI Manager — search "Woosh" and click Install
|
| 62 |
+
# Or manually:
|
| 63 |
+
cd ComfyUI/custom_nodes
|
| 64 |
+
git clone https://github.com/Saganaki22/ComfyUI-Woosh.git
|
| 65 |
+
pip install -r ComfyUI-Woosh/requirements.txt
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
Place downloaded model folders in `ComfyUI/models/woosh/`. See the [ComfyUI-Woosh
|
| 69 |
+
README](https://github.com/Saganaki22/ComfyUI-Woosh) for full setup and workflow examples.
|
| 70 |
+
|
| 71 |
+
> **Note:** Set the Woosh TextConditioning node to **T2A** for Flow/DFlow models and **V2A** for VFlow/DVFlow models.
|
| 72 |
+
|
| 73 |
+
## Inference
|
| 74 |
+
|
| 75 |
+
See the [official Woosh repository](https://github.com/SonyResearch/Woosh) for standalone inference code and training
|
| 76 |
+
details.
|
| 77 |
+
|
| 78 |
+
## VRAM Requirements
|
| 79 |
+
|
| 80 |
+
| Model | VRAM (Approx) |
|
| 81 |
+
|-------|---------------|
|
| 82 |
+
| Flow / VFlow | ~8-12 GB |
|
| 83 |
+
| DFlow / DVFlow | ~4-6 GB |
|
| 84 |
+
| With CPU offload | ~2-4 GB |
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
```bibtex
|
| 89 |
+
@article{saghibakshi2025woosh,
|
| 90 |
+
title={Woosh: Enhancing Text-to-Audio Generation with Flow Matching and FlowMap Distillation},
|
| 91 |
+
author={Saghibakshi, Ali and Bakshi, Soroosh and Tagliasacchi, Antonio and Wang, Shaojie and Choi, Jongmin and
|
| 92 |
+
Kawakami, Kazuhiro and Gu, Yuxuan},
|
| 93 |
+
journal={arXiv preprint arXiv:2502.07359},
|
| 94 |
+
year={2025}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## License
|
| 99 |
+
|
| 100 |
+
- **Code** — Apache 2.0
|
| 101 |
+
- **Model Weights** — [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)
|
TextConditionerA/config.yaml
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: false
|
| 3 |
+
sentence_config:
|
| 4 |
+
frozen: false
|
| 5 |
+
model: roberta-large
|
| 6 |
+
max_sentence_tokens: 77
|
| 7 |
+
adopt_n_layers: 0
|
| 8 |
+
adopt_layer_size: 2048
|
| 9 |
+
pool_type: eos
|
| 10 |
+
add_pooling_layer: true
|
| 11 |
+
hidden_dropout_prob: 0.1
|
| 12 |
+
attention_probs_dropout_prob: 0.1
|
| 13 |
+
finetune_n_layers: -1
|
| 14 |
+
last_hidden_state: true
|
| 15 |
+
use_shared_space: false
|
| 16 |
+
normalize_shared_space: true
|
| 17 |
+
freeze_clap: true
|
| 18 |
+
lhs_index: -2
|
| 19 |
+
remove_special_tokens: false
|
| 20 |
+
eval_mode: true
|
| 21 |
+
text_preprocessing: no_op
|
| 22 |
+
shared_representation_size: 1024
|
TextConditionerA/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9f66e1877efd49e3198068ffac74d00a84df18e53189643373e2e777224856bb
|
| 3 |
+
size 1425689504
|
TextConditionerV/config.yaml
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: false
|
| 3 |
+
sentence_config:
|
| 4 |
+
frozen: false
|
| 5 |
+
model: roberta-large
|
| 6 |
+
max_sentence_tokens: 77
|
| 7 |
+
adopt_n_layers: 0
|
| 8 |
+
adopt_layer_size: 2048
|
| 9 |
+
pool_type: eos
|
| 10 |
+
add_pooling_layer: true
|
| 11 |
+
hidden_dropout_prob: 0.1
|
| 12 |
+
attention_probs_dropout_prob: 0.1
|
| 13 |
+
finetune_n_layers: -1
|
| 14 |
+
last_hidden_state: true
|
| 15 |
+
use_shared_space: false
|
| 16 |
+
normalize_shared_space: true
|
| 17 |
+
freeze_clap: true
|
| 18 |
+
lhs_index: -2
|
| 19 |
+
remove_special_tokens: false
|
| 20 |
+
eval_mode: true
|
| 21 |
+
text_preprocessing: no_op
|
| 22 |
+
shared_representation_size: 1024
|
TextConditionerV/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b0dd912c322582cb95531de69cf1472c90752e468087a3b0fc461dea49975e80
|
| 3 |
+
size 1425689504
|
Woosh-AE/config.yaml
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: true
|
| 2 |
+
trainable: true
|
| 3 |
+
istft_head:
|
| 4 |
+
_target_: woosh.module.model.vocos.ISTFTCircleHead
|
| 5 |
+
_partial_: true
|
| 6 |
+
padding: center
|
| 7 |
+
conv_pad: same
|
| 8 |
+
conv_kernel: 1
|
| 9 |
+
softclip: softplus
|
| 10 |
+
ztransform:
|
| 11 |
+
_target_: woosh.module.model.vocos.ZeroDropoutTransform
|
| 12 |
+
_partial_: true
|
| 13 |
+
p: 1.0
|
| 14 |
+
_target_: woosh.module.model.VocosAutoEncoder
|
| 15 |
+
channels: 1
|
| 16 |
+
z_dim: 128
|
| 17 |
+
d_model: 2048
|
| 18 |
+
intermediate_dim: 3072
|
| 19 |
+
n_fft: 960
|
| 20 |
+
hop_length: 480
|
| 21 |
+
num_layers: 8
|
| 22 |
+
enc_num_layers: 8
|
| 23 |
+
input_layer_norm: true
|
| 24 |
+
final_layer_norm: true
|
| 25 |
+
stft_normalized: false
|
| 26 |
+
spec_embed: stft-complex
|
| 27 |
+
sample_rate: 48000
|
Woosh-AE/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:48e0d864beee90b17d232c74b16e9878dadc73b4ae55e4f6939df3f559c5d88a
|
| 3 |
+
size 884664420
|
Woosh-DFlow/config.yaml
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: true
|
| 3 |
+
ldm:
|
| 4 |
+
exclude_from_checkpoint: false
|
| 5 |
+
trainable: true
|
| 6 |
+
model_type: LatentDiffusionModel
|
| 7 |
+
dit:
|
| 8 |
+
model_type: mmmssflux
|
| 9 |
+
rope_len_multiplier: 2
|
| 10 |
+
max_description_length: 77
|
| 11 |
+
max_seq_len: 501
|
| 12 |
+
dim: 1024
|
| 13 |
+
inter_dim: 4096
|
| 14 |
+
fixed_timestep_features: false
|
| 15 |
+
timestep_features_dim: 256
|
| 16 |
+
n_layers: 12
|
| 17 |
+
n_heads: 8
|
| 18 |
+
n_multimodal_layers: 6
|
| 19 |
+
qk_nope_head_dim: 16
|
| 20 |
+
qk_rope_head_dim: 112
|
| 21 |
+
qkv_head_dim: 128
|
| 22 |
+
n_memory_tokens_rope: 1
|
| 23 |
+
n_memory_tokens_description: 0
|
| 24 |
+
original_seq_len: 501
|
| 25 |
+
rope_theta: 10000.0
|
| 26 |
+
rope_factor: 40.0
|
| 27 |
+
beta_fast: 32
|
| 28 |
+
beta_slow: 1
|
| 29 |
+
io_channels: 128
|
| 30 |
+
cond_token_dim: 1024
|
| 31 |
+
adaln_last_layer: true
|
| 32 |
+
adaln_last_layer_nomod: false
|
| 33 |
+
non_checkpoint_layers: 14
|
| 34 |
+
mask_out_before: -1
|
| 35 |
+
estimate_logvar: true
|
| 36 |
+
no_description_mask: true
|
| 37 |
+
symmetric_attention_init: false
|
| 38 |
+
patch_size: 1
|
| 39 |
+
num_sinks: 0
|
| 40 |
+
mlp_act: gelu
|
| 41 |
+
conditioners:
|
| 42 |
+
text:
|
| 43 |
+
exclude_from_checkpoint: true
|
| 44 |
+
trainable: false
|
| 45 |
+
path: checkpoints/TextConditionerA
|
| 46 |
+
autoencoder:
|
| 47 |
+
exclude_from_checkpoint: true
|
| 48 |
+
trainable: false
|
| 49 |
+
path: checkpoints/Woosh-AE
|
| 50 |
+
sigma_data: 1.0
|
| 51 |
+
pred_type: v_pred
|
Woosh-DFlow/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c9fc4a2220fa4ba54e01611afe58e061854a58ad559eb56de2cca39112f7c8ef
|
| 3 |
+
size 1378890620
|
Woosh-DVFlow-8s/config.yaml
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: true
|
| 3 |
+
pretrained_model_type: videokontext
|
| 4 |
+
ldm:
|
| 5 |
+
exclude_from_checkpoint: false
|
| 6 |
+
trainable: true
|
| 7 |
+
model_type: VideoKontextLDM
|
| 8 |
+
ldm:
|
| 9 |
+
exclude_from_checkpoint: false
|
| 10 |
+
trainable: true
|
| 11 |
+
dit:
|
| 12 |
+
model_type: mmmssflux
|
| 13 |
+
max_description_length: 77
|
| 14 |
+
max_seq_len: 501
|
| 15 |
+
rope_len_multiplier: 2
|
| 16 |
+
dim: 1024
|
| 17 |
+
inter_dim: 4096
|
| 18 |
+
fixed_timestep_features: false
|
| 19 |
+
timestep_features_dim: 256
|
| 20 |
+
n_layers: 12
|
| 21 |
+
n_heads: 8
|
| 22 |
+
n_multimodal_layers: 6
|
| 23 |
+
qk_nope_head_dim: 16
|
| 24 |
+
qk_rope_head_dim: 112
|
| 25 |
+
qkv_head_dim: 128
|
| 26 |
+
n_memory_tokens_rope: 1
|
| 27 |
+
n_memory_tokens_description: 0
|
| 28 |
+
original_seq_len: 501
|
| 29 |
+
rope_theta: 10000.0
|
| 30 |
+
rope_factor: 40.0
|
| 31 |
+
beta_fast: 32
|
| 32 |
+
beta_slow: 1
|
| 33 |
+
io_channels: 128
|
| 34 |
+
cond_token_dim: 1024
|
| 35 |
+
adaln_last_layer: true
|
| 36 |
+
adaln_last_layer_nomod: false
|
| 37 |
+
non_checkpoint_layers: 14
|
| 38 |
+
mask_out_before: -1
|
| 39 |
+
estimate_logvar: true
|
| 40 |
+
no_description_mask: true
|
| 41 |
+
symmetric_attention_init: false
|
| 42 |
+
patch_size: 1
|
| 43 |
+
num_sinks: 0
|
| 44 |
+
mlp_act: gelu
|
| 45 |
+
conditioners:
|
| 46 |
+
text:
|
| 47 |
+
exclude_from_checkpoint: true
|
| 48 |
+
trainable: false
|
| 49 |
+
path: checkpoints/TextConditionerV
|
| 50 |
+
autoencoder:
|
| 51 |
+
exclude_from_checkpoint: true
|
| 52 |
+
trainable: false
|
| 53 |
+
path: checkpoints/Woosh-AE
|
| 54 |
+
sigma_data: 1.0
|
| 55 |
+
pred_type: v_pred
|
| 56 |
+
audio_fps: 100
|
| 57 |
+
video_fps: 24
|
| 58 |
+
embed_dim: 768
|
| 59 |
+
embed_key: synch_out
|
| 60 |
+
pts_seconds_key: synch_pts_seconds
|
| 61 |
+
non_checkpoint_layers: 14
|
| 62 |
+
n_layers_encoder: 0
|
| 63 |
+
trainable_no_cond: false
|
| 64 |
+
use_batch_mask: true
|
Woosh-DVFlow-8s/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c87b796798fd96a4a3b1deeb70ccd05c6dc782fe546b622ce1368a05e441cd30
|
| 3 |
+
size 1684261500
|
Woosh-Flow/config.yaml
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: true
|
| 3 |
+
model_type: LatentDiffusionModel
|
| 4 |
+
dit:
|
| 5 |
+
model_type: mmmssflux
|
| 6 |
+
rope_len_multiplier: 2
|
| 7 |
+
max_description_length: 77
|
| 8 |
+
max_seq_len: 501
|
| 9 |
+
dim: 1024
|
| 10 |
+
inter_dim: 4096
|
| 11 |
+
fixed_timestep_features: false
|
| 12 |
+
timestep_features_dim: 256
|
| 13 |
+
n_layers: 12
|
| 14 |
+
n_heads: 8
|
| 15 |
+
n_multimodal_layers: 6
|
| 16 |
+
qk_nope_head_dim: 16
|
| 17 |
+
qk_rope_head_dim: 112
|
| 18 |
+
qkv_head_dim: 128
|
| 19 |
+
n_memory_tokens_rope: 1
|
| 20 |
+
n_memory_tokens_description: 0
|
| 21 |
+
original_seq_len: 501
|
| 22 |
+
rope_theta: 10000.0
|
| 23 |
+
rope_factor: 40.0
|
| 24 |
+
beta_fast: 32
|
| 25 |
+
beta_slow: 1
|
| 26 |
+
io_channels: 128
|
| 27 |
+
cond_token_dim: 1024
|
| 28 |
+
adaln_last_layer: true
|
| 29 |
+
adaln_last_layer_nomod: false
|
| 30 |
+
non_checkpoint_layers: 14
|
| 31 |
+
mask_out_before: -1
|
| 32 |
+
estimate_logvar: true
|
| 33 |
+
no_description_mask: true
|
| 34 |
+
symmetric_attention_init: false
|
| 35 |
+
patch_size: 1
|
| 36 |
+
num_sinks: 0
|
| 37 |
+
mlp_act: gelu
|
| 38 |
+
conditioners:
|
| 39 |
+
text:
|
| 40 |
+
exclude_from_checkpoint: true
|
| 41 |
+
trainable: false
|
| 42 |
+
path: checkpoints/TextConditionerA
|
| 43 |
+
autoencoder:
|
| 44 |
+
exclude_from_checkpoint: true
|
| 45 |
+
trainable: false
|
| 46 |
+
path: checkpoints/Woosh-AE
|
| 47 |
+
sigma_data: 1.0
|
| 48 |
+
pred_type: v_pred
|
Woosh-Flow/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4e2eafa2f9063934859888a2edd25ad4485f82dca4caaa7720f98d832f7ca39b
|
| 3 |
+
size 1348979404
|
Woosh-VFlow-8s/config.yaml
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
exclude_from_checkpoint: false
|
| 2 |
+
trainable: true
|
| 3 |
+
model_type: VideoKontextLDM
|
| 4 |
+
ldm:
|
| 5 |
+
exclude_from_checkpoint: false
|
| 6 |
+
trainable: true
|
| 7 |
+
dit:
|
| 8 |
+
model_type: mmmssflux
|
| 9 |
+
max_description_length: 77
|
| 10 |
+
max_seq_len: 501
|
| 11 |
+
rope_len_multiplier: 2
|
| 12 |
+
dim: 1024
|
| 13 |
+
inter_dim: 4096
|
| 14 |
+
fixed_timestep_features: false
|
| 15 |
+
timestep_features_dim: 256
|
| 16 |
+
n_layers: 12
|
| 17 |
+
n_heads: 8
|
| 18 |
+
n_multimodal_layers: 6
|
| 19 |
+
qk_nope_head_dim: 16
|
| 20 |
+
qk_rope_head_dim: 112
|
| 21 |
+
qkv_head_dim: 128
|
| 22 |
+
n_memory_tokens_rope: 1
|
| 23 |
+
n_memory_tokens_description: 0
|
| 24 |
+
original_seq_len: 501
|
| 25 |
+
rope_theta: 10000.0
|
| 26 |
+
rope_factor: 40.0
|
| 27 |
+
beta_fast: 32
|
| 28 |
+
beta_slow: 1
|
| 29 |
+
io_channels: 128
|
| 30 |
+
cond_token_dim: 1024
|
| 31 |
+
adaln_last_layer: true
|
| 32 |
+
adaln_last_layer_nomod: false
|
| 33 |
+
non_checkpoint_layers: 14
|
| 34 |
+
mask_out_before: -1
|
| 35 |
+
estimate_logvar: true
|
| 36 |
+
no_description_mask: true
|
| 37 |
+
symmetric_attention_init: false
|
| 38 |
+
patch_size: 1
|
| 39 |
+
num_sinks: 0
|
| 40 |
+
mlp_act: gelu
|
| 41 |
+
conditioners:
|
| 42 |
+
text:
|
| 43 |
+
exclude_from_checkpoint: true
|
| 44 |
+
trainable: false
|
| 45 |
+
path: checkpoints/TextConditionerV
|
| 46 |
+
autoencoder:
|
| 47 |
+
exclude_from_checkpoint: true
|
| 48 |
+
trainable: false
|
| 49 |
+
path: checkpoints/Woosh-AE
|
| 50 |
+
sigma_data: 1.0
|
| 51 |
+
pred_type: v_pred
|
| 52 |
+
audio_fps: 100
|
| 53 |
+
video_fps: 24
|
| 54 |
+
embed_dim: 768
|
| 55 |
+
embed_key: synch_out
|
| 56 |
+
pts_seconds_key: synch_pts_seconds
|
| 57 |
+
non_checkpoint_layers: 14
|
| 58 |
+
n_layers_encoder: 0
|
| 59 |
+
trainable_no_cond: false
|
| 60 |
+
use_batch_mask: true
|
Woosh-VFlow-8s/weights.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ed2e7639e830060e532c955d4b4dc23869c8574605267846fa326c28730fcc7d
|
| 3 |
+
size 1654350276
|