Duplicate from Motif-Technologies/Motif-Video-2B

ef28144 23 days ago

21.3 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- text-to-video
	- image-to-video
	- video-generation
	- diffusion-transformer
	pipeline_tag: text-to-video
	library_name: diffusers
	widget:
	- text: "A vibrant blue jay perches gracefully on a slender branch, its feathers shimmering in the soft morning light. The bird's keen eyes scan the surroundings, capturing the essence of the tranquil forest. It flutters its wings briefly, showcasing the intricate patterns of blue, white, and black on its plumage. The background reveals a lush canopy of green leaves, with rays of sunlight filtering through, creating a dappled effect on the forest floor. The blue jay then tilts its head, emitting a melodious call that echoes through the serene woodland, adding a touch of magic to the peaceful scene."
	output:
	url: assets/bird.mp4
	- text: "Underwater footage of a vibrant coral reef ecosystem with tropical fish swimming through coral formations. Natural sunlight filtering down through clear water creates dancing light patterns on the reef. Smooth underwater camera movement, natural color correction preserving authentic ocean blues and coral colors, documentary marine biology style, peaceful and educational mood."
	output:
	url: assets/underwater.mp4
	- text: "An old fisherman mends his nets on a stone harbor wall, weathered hands moving with practiced speed through the green mesh. Shot on a 50mm lens with a slow dolly-in from his side, the afternoon sun throws warm light across his salt-stained coat and the worn granite beneath him. Behind him, a single wooden boat bobs gently in a turquoise bay. Gulls drift through the distant sky in soft focus. The camera settles on his hands, then racks focus to his weathered, squinting eyes."
	output:
	url: assets/fisherman.mp4
	- text: "A lone astronaut drifts just outside a derelict space station, tethered by a single silver line as Earth's terminator glows blue-white behind her. Shot with a slow wide-to-medium push, the camera floats alongside her in weightless silence, the curvature of the planet filling the lower third of the frame. Sunlight rakes across the hull's scarred panels, casting long hard shadows that stretch and shift as she rotates. Her visor reflects the aurora below, ribbons of green pulling across the glass. She reaches out with a gloved hand and lets her fingertips graze a dented antenna, the gesture small and reverent."
	output:
	url: assets/astronaut.mp4
	- text: "A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still."
	output:
	url: assets/woman.mp4
	---

	<p align="center">
	<img src="assets/banner.png" width="100%" alt="Motif-Video 2B teaser"/>
	</p>

	<p align="center">
	<h1 align="center">Motif-Video 2B</h1>
	</p>

	<p align="center">
	<b>A micro-budget text-to-video diffusion transformer from Motif Technologies</b>
	</p>

	<p align="center">
	📑 <a href="https://arxiv.org/abs/2604.16503">Technical Report</a>  \|
	🤗 <a href="">Hugging Face</a>  \|
	🌐 <a href="https://motiftech.io/videoshowcase">Project Page</a>
	</p>

	---

	<!--
	NOTE: This README is written against the CURRENT state of diffusers PR #13551
	(pre-merge). The PR currently has issues:
	- negative_prompt defaults to None (should be built-in)
	- use_linear_quadratic_schedule defaults to True (should be False)
	- DPMSolverMultistepScheduler crashes (pipeline always passes sigmas)
	- No built-in SageAttention support (requires manual patching)

	Code examples below include workarounds (explicit negative_prompt,
	use_linear_quadratic_schedule=False, _FlowDPMSolver subclass).

	TODO: Update after PR feedback is applied, and again after merge.
	Tracking: https://github.com/MotifTechnologies/diffusers/pull/1
	-->

	## 🔥 News

	- [2026-04-29] RTX 4090 benchmarks added — SageAttention achieves ~3.16× speedup, all GGUF variants fit in 24 GB. See [GGUF + SageAttention](docs/gguf-sageattention.md#benchmark).
	- [2026-04-28] ComfyUI custom nodes released: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B). GGUF workflow support coming soon.
	- [2026-04-28] GGUF quantized weights now available at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) — up to 2.7 GB VRAM savings with no speed penalty. SageAttention support for ~2× faster inference. See [GGUF + SageAttention](#🧊-gguf--sageattention) below.
	- [2026-04-14] We release Motif-Video 2B, our 2B-parameter text-to-video and image-to-video diffusion transformer, together with the full [technical report](https://arxiv.org/abs/2604.16503).

	---

	## 📖 Introduction

	Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. Motif-Video 2B asks whether competitive text-to-video quality is reachable at a much smaller budget — fewer than 10M training clips and under 100,000 H200 GPU hours — and shows that the answer is yes, provided the model design explicitly separates objectives that scaling would otherwise leave entangled.

	Our central observation is that prompt alignment, temporal consistency, and fine-detail recovery interfere with one another when handled through the same pathway. Motif-Video 2B addresses this objective interference architecturally rather than relying on scale alone, through two contributions:

	- Shared Cross-Attention. A residual cross-attention mechanism that reuses self-attention K/V weights to stabilize text–video alignment under long-context token sparsity, where standard joint attention dilutes text influence as the video token sequence grows.
	- Three-stage DDT-style backbone. 12 dual-stream + 16 single-stream + 8 DDT decoder layers, separating early modality fusion, joint representation learning, and high-frequency detail reconstruction into dedicated components. Per-block attention analysis shows that the DDT decoder spontaneously develops inter-frame attention structure absent from the encoder layers.

	These are paired with a micro-budget training recipe combining TREAD token routing and early-phase REPA with a frozen V-JEPA teacher — to our knowledge, the first time this combination has been applied to text-to-video training.

	On VBench, Motif-Video 2B reaches 83.76%, the highest Total Score among open-source models we evaluate, surpassing Wan2.1-14B at 7× fewer parameters and roughly an order of magnitude less training data.

	<!--
	Architecture figure — replace with Figure 2 from the technical report
	(the three-stage backbone + Shared Cross-Attention diagram).
	-->
	<p align="center">
	<img src="assets/architecture.png" width="90%" alt="Motif-Video 2B architecture"/>
	</p>

	---

	## ✨ Highlights

	- Two tasks, one set of weights. A single checkpoint handles both text-to-video (T2V) and image-to-video (I2V) generation, trained jointly without a learnable task-type embedding.
	- Up to 720p, 121 frames. The final model generates 720p video at 121 frames under the standard rectified flow-matching sampler.
	- Architectural specialization over brute-force scale. Three-stage backbone with role-separated dual-stream / single-stream / DDT decoder layers.
	- Shared Cross-Attention. Stabilizes text alignment under long video-token sequences by grounding cross-attention K/V in the self-attention manifold.
	- Micro-budget recipe. TREAD token routing (≈27% per-step FLOP reduction) + early-phase REPA with V-JEPA teacher + offline bucket-balanced sampler (≈90% data utilization, up from ≈20% baseline).
	- Open and reproducible. Trained on ~64×H200 GPUs with FSDP2, full curriculum and recipe documented in the technical report.

	---

	## 🏗️ Architecture

	Motif-Video 2B is a flow-matching diffusion transformer organized around a single principle: each component is assigned a well-defined responsibility, and components with conflicting objectives are not asked to share capacity.

	\| Component \| Choice \|
	\|---\|---\|
	\| Text encoder \| T5Gemma2 (encoder–decoder, UL2-adapted Gemma 3) \|
	\| Video tokenizer \| Wan2.1 VAE (8×8 spatial, 4× temporal compression), 2×2×1 patchify \|
	\| Backbone \| 12 dual-stream + 16 single-stream + 8 DDT decoder layers \|
	\| Hidden dim / heads \| 1536 / 12 heads × 128 \|
	\| Normalization \| QK-normalization throughout \|
	\| Position encoding \| RoPE \|
	\| Cross-attention \| Shared Cross-Attention in the single-stream stage \|
	\| Objective \| Rectified flow matching (velocity prediction) \|
	\| I2V conditioning \| First-frame latent + SigLIP image embeddings, with timestep-aware blur \|

	A high-level walkthrough of the role separation:

	1. Dual-stream stage (12 layers). Text and video tokens are processed through separate self-attention pathways, exchanging information via cross-attention. This prevents premature feature entanglement before either modality has formed coherent representations.
	2. Single-stream stage (16 layers). Text and video tokens attend freely in a joint sequence. Shared Cross-Attention is attached here to repair the text-attention dilution that emerges as the video token sequence grows.
	3. DDT decoder (8 layers). A dedicated velocity decoder atop the 28-layer encoder, freeing the encoder from high-frequency detail reconstruction. Per-block attention analysis shows that the DDT decoder develops inter-frame attention structure that single-stream layers do not.

	For the full derivation of why Shared Cross-Attention shares K/V but not Q, and why this is necessary in addition to standard zero-init of W_O, see Section 3.3 of the [technical report](https://arxiv.org/abs/2604.16503).

	<!--
	Optional: insert Figure 3 (attention heatmaps across the three stages)
	here as a secondary architecture figure. It is the strongest visual
	evidence for the role-separation argument.
	-->

	---

	## 🚀 Quickstart / Usage

	### Requirements

	- Python 3.10+
	- CUDA-capable GPU with 30GB+ VRAM (e.g., A100, H100) — for 24GB GPUs see [Memory-efficient Inference](🔋-memory-efficient-inference)

	```bash
	pip install "transformers>=5.5.4" torch accelerate ftfy einops sentencepiece regex Pillow imageio imageio-ffmpeg
	pip install git+https://github.com/waitingcheung/diffusers.git@feat/motif-video
	```

	### Text-to-Video (T2V)

	```python
	import torch
	from diffusers import (
	AdaptiveProjectedGuidance,
	DPMSolverMultistepScheduler,
	MotifVideoPipeline,
	)
	from diffusers.utils import export_to_video


	guider = AdaptiveProjectedGuidance(
	guidance_scale=8.0,
	adaptive_projected_guidance_rescale=12.0,
	adaptive_projected_guidance_momentum=0.1,
	use_original_formulation=True,
	normalization_dims="spatial",
	)

	pipe = MotifVideoPipeline.from_pretrained(
	"Motif-Technologies/Motif-Video-2B",
	torch_dtype=torch.bfloat16,
	guider=guider,
	)
	pipe = pipe.to("cuda")

	output = pipe(
	prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
	negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
	height=736,
	width=1280,
	num_frames=121,
	num_inference_steps=50,
	frame_rate=24,
	)

	export_to_video(output.frames[0], "output.mp4", fps=24)
	```

	### Image-to-Video (I2V)

	```python
	import torch
	from diffusers import (
	AdaptiveProjectedGuidance,
	DPMSolverMultistepScheduler,
	MotifVideoPipeline,
	)
	from diffusers.utils import export_to_video, load_image

	guider = AdaptiveProjectedGuidance(
	guidance_scale=8.0,
	adaptive_projected_guidance_rescale=12.0,
	adaptive_projected_guidance_momentum=0.1,
	use_original_formulation=True,
	normalization_dims="spatial",
	)

	pipe = MotifVideoImage2VideoPipeline.from_pretrained(
	"Motif-Technologies/Motif-Video-2B",
	torch_dtype=torch.bfloat16,
	guider=guider,
	)
	pipe = pipe.to("cuda")

	image = load_image("https://huggingface.co/Motif-Technologies/Motif-Video-2B/resolve/main/assets/i2v_sample.jpg")

	output = pipe(
	prompt="Three friends stride through a sun-bleached meadow as a warm breeze ripples the tall dry grass around their legs.",
	negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
	image=image,
	height=736,
	width=1280,
	num_frames=121,
	num_inference_steps=50,
	frame_rate=24,
	)

	export_to_video(output.frames[0], "output.mp4", fps=24)
	```

	### CLI Inference

	```bash
	# Text-to-Video (default settings)
	python inference.py \
	--prompt "A woman standing in a sunlit field as..." \
	--output t2v_output.mp4

	# With SageAttention (~2x faster, requires sageattention package)
	python inference.py \
	--prompt "Three friends stride through a sun-bleached meadow..." \
	--use-sage-attention \
	--output t2v_output.mp4
	```

	See `inference.py --help` for all available options.

	### Recommended Settings

	\| Parameter \| Default \| Notes \|
	\|---\|---\|---\|
	\| Resolution \| 1280×736 \| 720p, best quality \|
	\| Frames \| 121 \| ~5 seconds at 24fps \|
	\| Scheduler \| DPMSolver++ \| `solver_order=2`, `flow_shift=15.0` \|
	\| Guidance scale \| 8.0 \| With APG (`normalization_dims="spatial"`) \|
	\| Inference steps \| 50 \| \|
	\| Negative prompt \| (built-in) \| See code examples above \|
	\| `use_linear_quadratic_schedule` \| `False` \| Must be set explicitly \|
	\| dtype \| bfloat16 \| Recommended for H100/A100 \|

	### 🔋 Memory-efficient Inference

	For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), CPU offloading and FP8 quantization can reduce peak VRAM from ~30 GB to ~15 GB with minimal speed impact.

	\| Mode \| Peak VRAM \| Recommended GPU \|
	\|------\|-----------\|-----------------\|
	\| `pipe.to("cuda")` \| ~30 GB \| A100, H100, H200 \|
	\| `enable_model_cpu_offload()` \| ~19 GB \| RTX 4090, RTX 3090 \|
	\| `+ FP8 quantization` \| ~15 GB \| RTX 4090, RTX 3090 \|

	> Full guide → [docs/memory-efficient-inference.md](docs/memory-efficient-inference.md)

	---

	### 🧊 GGUF + SageAttention

	GGUF quantized weights at [Motif-Video-2B-GGUF](https://huggingface.co/Motif-Technologies/Motif-Video-2B-GGUF) — up to 2.7 GB VRAM savings with no speed penalty. Combined with [SageAttention](https://github.com/thu-ml/SageAttention) for ~1.6× faster inference.

	\| Variant \| Sage (s/it) \| Speedup \| Peak alloc (GB) \|
	\|---------\|------------\|---------\|-----------------\|
	\| BF16 \| 14.75 \| 1.58x \| 15.12 \|
	\| Q8_0 \| 14.49 \| 1.60x \| 13.44 \|
	\| Q4_K_M \| 14.59 \| 1.60x \| 12.53 \|

	> Full guide → [docs/gguf-sageattention.md](docs/gguf-sageattention.md)

	---

	### 🖥️ ComfyUI

	Official ComfyUI custom nodes: [ComfyUI-MotifVideo2B](https://github.com/MotifTechnologies/ComfyUI-MotifVideo2B)

	> Note: Currently requires High VRAM mode. GGUF quantized model loading in ComfyUI is in progress.

	---

	## 📊 Performance

	### VBench

	Motif-Video 2B achieves the highest Total Score among open-source models we evaluate.

	\| Model \| Params \| Total \| Quality \| Semantic \|
	\|---\|---\|---\|---\|---\|
	\| Wan2.2-T2V (prompt-opt.) \| A14B \| 84.23 \| 85.42 \| 79.50 \|
	\| Motif-Video 2B (Ours) \| 2B \| 83.76 \| 84.59 \| 80.44 \|
	\| SANA-Video \| 2B \| 83.71 \| 84.35 \| 81.35 \|
	\| Wan2.1-T2V \| 14B \| 83.69 \| 85.59 \| 76.11 \|
	\| OpenSora 2.0 (T2I2V) \| 11B \| 83.60 \| 84.40 \| 80.30 \|
	\| Wan2.1-T2V \| 1.3B \| 83.31 \| 85.23 \| 75.65 \|
	\| HunyuanVideo \| 13B \| 83.24 \| 85.09 \| 75.82 \|
	\| CogVideoX1.5-5B (prompt-opt.) \| 5B \| 82.17 \| 82.78 \| 79.76 \|
	\| Step-Video-T2V \| 30B \| 81.83 \| 84.46 \| 71.28 \|
	\| LTX-Video \| 2B \| 80.00 \| 82.30 \| 70.79 \|

	Notable per-dimension highlights for Motif-Video 2B (open-source):

	- Spatial Relationship: 83.02% — best among open-source models
	- Semantic Score: 80.44% — highest among open-source models reporting per-dimension results
	- Object Class: 92.93%, Multiple Objects: 77.29%, Imaging Quality: 70.50% — second-best in their categories

	The full 16-dimension breakdown is in Table 3 of the [technical report](https://arxiv.org/abs/2604.16503).

	> A note on VBench vs. perceptual quality. Motif-Video 2B leads on VBench Total Score, but in our internal side-by-side comparisons against Wan2.1-T2V-14B we observe a perceptual gap in favor of the larger model on temporal stability and fine human anatomy. We discuss the sources of this gap (uniform dimension weighting, near-correct semantic credit) in Section 7 of the report. We report the gap explicitly rather than smoothing it over.

	### Human evaluation

	In a blind pairwise study against six contemporaneous open-source baselines (SANA-Video, LTX-Video 2, Wan2.1-14B, Wan2.1-1.3B, Wan2.2-5B, CogVideoX-5B) on 40 LLM-generated prompts, Motif-Video 2B is preferred over both SANA-Video (similar parameter count) and Wan2.1-1.3B (similar parameter count, larger training corpus) on prompt-following and video-fidelity axes. Wan2.1-14B remains the preferred model overall, consistent with its 7× larger parameter count and substantially larger training data.

	---

	## 🎬 Showcase

	<!--
	Insert the qualitative grids from the technical report here:
	- Figure 1 / Figure 12: T2V multi-prompt frame strips
	- Figure 13: I2V example (input image + generated frames)
	Use full-width or 2-column layout, matching Wan2.1's "Showcase" section.
	-->

	### Text-to-Video

	<p align="center">
	<img src="assets/showcase_t2v.png" width="100%" alt="Motif-Video 2B T2V samples"/>
	</p>

	### Image-to-Video

	<p align="center">
	<img src="assets/showcase_i2v.png" width="100%" alt="Motif-Video 2B I2V samples"/>
	</p>

	---

	## ⚠️ Limitations

	We report limitations as the boundary conditions under which the design decisions in this report should be interpreted, not as caveats.

	- Micro-scale semantic distortion. Motif-Video 2B occasionally produces sub-object-level artifacts that leave the category label intact but break perceptual plausibility — distorted hands on close-up human subjects, degraded body structure under high-displacement motion, and attribute leakage between visually similar co-present subjects. We attribute these primarily to data coverage rather than backbone design.
	- Temporal failures. Three distinct modes that frame-level metrics do not surface: (i) physically implausible liquid / cloth / collision dynamics, (ii) coherence loss under high scene complexity (multi-agent crowds), and (iii) unintended mid-clip scene transitions in long sequences.
	- Recipe components are evaluated jointly, not in isolation. We do not present per-component ablations for Shared Cross-Attention, the DDT decoder, REPA phasing, or TREAD routing at full scale. Readers should interpret our results as evidence that the composed recipe works at 2B, not as a marginal-contribution claim about any single component.

	We view temporal stability and data coverage — not architectural depth — as the primary remaining ceilings on this model. Both are the most natural axes for a future iteration that the current architecture is built to absorb.

	---

	## 📚 Citation

	If you find Motif-Video 2B useful in your research, please cite:

	```bibtex
	@techreport{motifvideo2b2026,
	title = {Motif-Video 2B: Technical Report},
	author = {Motif Technologies},
	year = {2026},
	institution = {Motif Technologies},
	url = {https://arxiv.org/abs/2604.16503}
	}
	```

	---

	## 🙏 Acknowledgements

	We build on a number of excellent open-source projects, including the Wan2.1 VAE [Wan Team, 2025], T5Gemma / Gemma 3 [Google], TREAD [Krause et al., 2025], REPA with the V-JEPA family of visual encoders [Bardes et al.], DDT [Wang et al.], and the broader diffusers and Accelerate ecosystems. Compute was provisioned on Microsoft Azure and orchestrated with SkyPilot on Kubernetes.

	---

	## 📄 License

	<!-- TODO: confirm final license — apache-2.0 placeholder above. -->

	This model is released under the Apache 2.0 License. See `LICENSE` for details.