1

Image-Text-to-Video
refalign
meka1018 gudaochangsheng commited on
Commit
c124ce3
Β·
0 Parent(s):

Duplicate from gudaochangsheng/RefAlign-1.3B

Browse files

Co-authored-by: Lei Wang <gudaochangsheng@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ asserts/abstract-refalign.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - BestWishYsh/OpenS2V-5M
5
+ - ZhuoweiChen/Phantom-data-Koala36M
6
+ base_model:
7
+ - Wan-AI/Wan2.1-T2V-1.3B
8
+ pipeline_tag: image-text-to-video
9
+ ---
10
+
11
+
12
+ # πŸš€ RefAlign: Representation Alignment for Reference-to-Video Generation
13
+
14
+ [![arXiv](https://img.shields.io/badge/arXiv-RefAlign-<COLOR>.svg)](https://arxiv.org/abs/2603.25743) [![arXiv](https://img.shields.io/badge/paper-RefAlign-b31b1b.svg)](https://arxiv.org/pdf/2603.25743) ![Visitors](https://visitor-badge.laobi.icu/badge?page_id=gudaochangsheng/RefAlign) [![HF-1.3B](https://img.shields.io/badge/HF-RefAlign--1.3B-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-1.3B)
15
+ [![HF-14B](https://img.shields.io/badge/HF-RefAlign--14B-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-14B)
16
+ [![MS-1.3B](https://img.shields.io/badge/ModelScope-RefAlign--1.3B-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B)
17
+ [![MS-14B](https://img.shields.io/badge/ModelScope-RefAlign--14B-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B)
18
+ [![Code](https://img.shields.io/badge/Code-RefAlign-black?style=flat&logo=github)](https://github.com/gudaochangsheng/RefAlign)
19
+ [![Project Page](https://img.shields.io/badge/Project-Page-2ea44f?style=flat-square)](https://gudaochangsheng.github.io/RefAlign-Page/)
20
+
21
+ <div align="center">
22
+ <a href="https://gudaochangsheng.github.io/">Lei Wang</a><sup>1,2,*,&ddagger;</sup>,
23
+ <a href="https://scholar.google.com/citations?hl=zh-TW&user=1uL_9HAAAAAJ">Yuxin Song</a><sup>2,&ddagger;</sup>,
24
+ <a href="https://github.com/Martinser">Ge Wu</a><sup>1</sup>,
25
+ <a href="https://scholar.google.com.hk/citations?user=pnuQ5UsAAAAJ&hl=zh-CN&oi=ao">Haocheng Feng</a><sup>2</sup>,
26
+ <a href="https://hangz-nju-cuhk.github.io/">Hang Zhou</a><sup>2</sup>,
27
+ <a href="https://jingdongwang2017.github.io/">Jingdong Wang</a><sup>2</sup>
28
+ <a href="https://yaxingwang.github.io/">Yaxing Wang</a><sup>4&dagger;</sup>
29
+ <a href="https://scholar.google.com.hk/citations?user=6CIDtZQAAAAJ&hl=en">Jian Yang</a><sup>1,3&dagger;</sup>
30
+ </div>
31
+
32
+ <div align="center">
33
+ <sup>1</sup> PCA Lab, VCIP, College of Computer Science, Nankai University &nbsp;&nbsp;
34
+ <sup>2</sup> Baidu Inc. &nbsp;&nbsp;
35
+ <sup>3</sup> PCA Lab, School of Intelligence Science and Technology, Nanjing University &nbsp;&nbsp;
36
+ <sup>4</sup> College of Artificial Intelligence, Jilin University
37
+ </div>
38
+
39
+ <div align="center">
40
+ &dagger;Corresponding authors *Interns in Baidu Inc. &ddagger;Equal Contribution
41
+ </div>
42
+
43
+ <div align="center">
44
+ <img src="asserts/abstract-refalign.png" alt="demo" style="width: 100%;" />
45
+ <br>
46
+ </div>
47
+
48
+ ---
49
+
50
+ ## πŸ† OpenS2V-Eval Leaderboard
51
+
52
+ > RefAlign achieves **state-of-the-art performance** on [OpenS2V-Eval](https://huggingface.co/spaces/BestWishYsh/OpenS2V-Eval) across multiple metrics.
53
+
54
+ | Model | Venue | TotalScore ↑ | Aesthetic ↑ | MotionSmoothness ↑ | MotionAmplitude ↑ | FaceSim ↑ | GmeScore ↑ | NexusScore ↑ | NaturalScore ↑ |
55
+ |---|---|---:|---:|---:|---:|---:|---:|---:|---:|
56
+ | πŸ₯‡ **RefAlign-14B (Ours)** | Open-Source | **60.42%** | 46.84% | **97.61%** | 22.48% | **55.23%** | 68.32% | **48.52%** | 73.63% |
57
+ | πŸ₯‡ **RefAlign-1.3B (Ours)** | Open-Source | **56.30%** | 42.96% | 94.74% | 20.74% | 53.06% | 66.85% | 43.97% | 66.25% |
58
+ | Saber | Closed-Source | 57.91% | 42.42% | 96.12% | 21.12% | 49.89% | 67.50% | 47.22% | 72.55% |
59
+ | VINO | Open-Source | 57.85% | 45.92% | 94.73% | 12.30% | 52.00% | 69.69% | 42.67% | 71.99% |
60
+ | BindWeave | Closed-Source | 57.61% | 45.55% | 95.90% | 13.91% | 53.71% | 67.79% | 46.84% | 66.85% |
61
+ | VACE-14B | Open-Source | 57.55% | **47.21%** | 94.97% | 15.02% | 55.09% | 67.27% | 44.08% | 67.04% |
62
+ | Phantom-14B | Open-Source | 56.77% | 46.39% | 96.31% | **33.42%** | 51.46% | **70.65%** | 37.43% | 69.35% |
63
+ | Kling1.6 | Closed-Source | 56.23% | 44.59% | 86.93% | **41.60%** | 40.10% | 66.20% | 45.89% | **74.59%** |
64
+ | Phantom-1.3B | Open-Source | 54.89% | 46.67% | 93.30% | 14.29% | 48.56% | 69.43% | 42.48% | 62.50% |
65
+ | MAGREF-480P | Open-Source | 52.51% | 45.02% | 93.17% | 21.81% | 30.83% | 70.47% | 43.04% | 66.90% |
66
+ | SkyReels-A2-P14B | Open-Source | 52.25% | 39.41% | 87.93% | 25.60% | 45.95% | 64.54% | 43.75% | 60.32% |
67
+ | Vidu2.0 | Closed-Source | 51.95% | 41.48% | 90.45% | 13.52% | 35.11% | 67.57% | 43.37% | 65.88% |
68
+
69
+ ## πŸ“¦ Model Weights
70
+
71
+ | Model | Params | Hugging Face | ModelScope |
72
+ |---|---:|---|---|
73
+ | RefAlign-1.3B | 1.3B | [![HF Download](https://img.shields.io/badge/HuggingFace-Download-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-1.3B) | [![MS Download](https://img.shields.io/badge/ModelScope-Download-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-1.3B) |
74
+ | RefAlign-14B | 14B | [![HF Download](https://img.shields.io/badge/HuggingFace-Download-yellow?logo=huggingface)](https://huggingface.co/gudaochangsheng/RefAlign-14B) | [![MS Download](https://img.shields.io/badge/ModelScope-Download-blue)](https://www.modelscope.cn/models/gudaochangsheng98/RefAlign-14B) |
75
+
76
+ > ⚠️ **Note**
77
+ >
78
+ > The provided weights are **DiT (Diffusion Transformer) checkpoints fine-tuned from Wan2.1**.
79
+ > To run RefAlign, please:
80
+ >
81
+ > 1. Download the original **[Wan2.1](https://huggingface.co/collections/Wan-AI/wan21)** model (including VAE, text encoder, etc.).
82
+ > 2. Replace the **DiT weights** in Wan2.1 with the RefAlign weights provided above.
83
+ >
84
+ > No modification is required for other components.
85
+
86
+ ## 🎬 Inference
87
+
88
+
89
+ ```shell
90
+ # Inference RefAlign-1.3B
91
+ python examples/wanvideo/model_inference/Wan2.1-T2V-1.3B_subject.py
92
+
93
+ # Inference RefAlign-14B
94
+ python examples/wanvideo/model_inference/Wan2.1-T2V-14B_subject.py
95
+ ```
96
+ ## Citation
97
+
98
+ If you find RefAlign useful, please consider giving our repository a star (⭐) and citing our [paper](https://arxiv.org/abs/2603.25743).
99
+
100
+ ```
101
+ @misc{wang2026refalign,
102
+ title={RefAlign: Representation Alignment for Reference-to-Video Generation},
103
+ author={Lei Wang and Yuxin Song and Ge Wu and Haocheng Feng and Hang Zhou and Jingdong Wang and Yaxing Wang and Jian Yang},
104
+ year={2026},
105
+ eprint={2603.25743},
106
+ archivePrefix={arXiv},
107
+ primaryClass={cs.CV}
108
+ }
109
+ ```
110
+ ## Acknowledgement
111
+
112
+ This project is based on [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio). Thanks for their awesome works.
113
+ We sincerely acknowledge the excellent and inspiring prior work, [Phantom](https://github.com/Phantom-video/Phantom), [VINO](https://sotamak1r.github.io/VINO-web/), [OpenS2V](https://github.com/PKU-YuanGroup/OpenS2V-Nexus), [Phantom-Data](https://phantom-video.github.io/Phantom-Data/) and [Wan2.1](https://wan.video/).
114
+ ## Contact
115
+ If you have any questions, please feel free to reach out to me at `scitop1998@gmail.com`.
asserts/abstract-refalign.png ADDED

Git LFS Details

  • SHA256: d9d3a75a1b85a70861c41b2c3584577f12920a6ddaaf7ec426a0bf07053b4b0b
  • Pointer size: 132 Bytes
  • Size of remote file: 1.61 MB
asserts/null.json ADDED
File without changes
config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "refalign",
3
+ "architectures": ["RefAlignForConditionalGeneration"],
4
+ "hidden_size": 0,
5
+ "num_hidden_layers": 0,
6
+ "note": "This config file is added to enable Hugging Face Hub download statistics tracking."
7
+ }
diffusion_pytorch_model-refalign.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc3698513712e22693c0d7fb2669298ba19af55167658308c9ad413a556df474
3
+ size 2838077304