Instructions to use smithcooly2k/Motif-Video-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use smithcooly2k/Motif-Video-2B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("smithcooly2k/Motif-Video-2B", dtype=torch.bfloat16, device_map="cuda") prompt = "A vibrant blue jay perches gracefully on a slender branch, its feathers shimmering in the soft morning light. The bird's keen eyes scan the surroundings, capturing the essence of the tranquil forest. It flutters its wings briefly, showcasing the intricate patterns of blue, white, and black on its plumage. The background reveals a lush canopy of green leaves, with rays of sunlight filtering through, creating a dappled effect on the forest floor. The blue jay then tilts its head, emitting a melodious call that echoes through the serene woodland, adding a touch of magic to the peaceful scene." image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
π§ GGUF + SageAttention
See the main README for
FlowDPMSolverand pipeline setup.
GGUF quantized transformer weights are available at Motif-Video-2B-GGUF, reducing VRAM with minimal quality loss. Combined with SageAttention for ~2Γ faster attention computation.
GGUF Inference
pip install gguf
import torch
from diffusers import (
AdaptiveProjectedGuidance,
DPMSolverMultistepScheduler,
GGUFQuantizationConfig,
MotifVideoPipeline,
MotifVideoTransformer3DModel,
)
from diffusers.utils import export_to_video
from huggingface_hub import hf_hub_download
guider = AdaptiveProjectedGuidance(
guidance_scale=8.0,
adaptive_projected_guidance_rescale=12.0,
adaptive_projected_guidance_momentum=0.1,
use_original_formulation=True,
normalization_dims="spatial",
)
variant = "Q4_K_M" # Options: Q4_0, Q4_1, Q4_K_M, Q5_0, Q5_1, Q5_K_M, Q6_K, Q8_0, BF16
ckpt_path = hf_hub_download(
"Motif-Technologies/Motif-Video-2B-GGUF",
filename=f"motifv-2b-dev-{variant}.gguf",
)
transformer = MotifVideoTransformer3DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
config="Motif-Technologies/Motif-Video-2B",
revision="diffusers-integration",
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
revision="diffusers-integration",
torch_dtype=torch.bfloat16,
guider=guider,
transformer=transformer,
)
pipe.scheduler = FlowDPMSolver(
num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
algorithm_type="dpmsolver++",
solver_order=2,
prediction_type="flow_prediction",
use_flow_sigmas=True,
flow_shift=15.0,
)
pipe.enable_model_cpu_offload()
output = pipe(
prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
height=736,
width=1280,
num_frames=121,
num_inference_steps=50,
frame_rate=24,
use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
SageAttention (Optional, ~1.6Γ faster)
Same prompt and seed, 1280x736, 121 frames, 50 steps. Left = SDPA, Right = SageAttention.
SageAttention accelerates attention by quantizing Q/K to INT8 and V to FP8, reducing memory bandwidth. Works with all GGUF variants.
Install (build from source β PyPI only has 1.x, need 2.x):
# Set TORCH_CUDA_ARCH_LIST to match your GPU:
# "8.0" for A100/A30
# "8.6" for RTX 3090/3080/A40
# "8.9" for RTX 4090/4080/4070 Ti/L40/L40S (Ada Lovelace)
# "10.0" for RTX 5090/5080/5070 Ti (Blackwell)
# "9.0" for H100/H200
TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation
Usage with inference.py:
python inference.py --use-sage-attention --prompt "..."
Notes:
- Requires NVIDIA GPU with SM70+
- SM90+ (H100, H200) β FP8 kernels for maximum speedup
- SM80-SM89 (A100, RTX 3090, RTX 4090) β FP16 kernels (still faster than SDPA)
- SM70-SM75 (V100, RTX 2080 Ti) β FP16 kernels
- Set
TORCH_CUDA_ARCH_LISTto match your GPU when building (e.g.,"8.6"for RTX 3090,"8.9"for RTX 4090) - No quality degradation observed across all GGUF variants
Benchmark (H200)
Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
| Variant | SDPA (s/it) | Sage (s/it) | Speedup | Peak alloc (GB) | Peak rsv (GB) | Total SDPA (s) | Total Sage (s) |
|---|---|---|---|---|---|---|---|
| BF16 | 23.36 | 14.75 | 1.58x | 14.78 / 15.12 | 24.93 / 24.90 | 1184 | 754 |
| Q8_0 | 23.16 | 14.49 | 1.60x | 13.10 / 13.44 | 23.14 / 23.11 | 1178 | 744 |
| Q6_K | 23.21 | 14.55 | 1.60x | 12.62 / 12.95 | 22.72 / 22.69 | 1178 | 747 |
| Q5_K_M | 23.33 | 14.69 | 1.59x | 12.39 / 12.72 | 22.45 / 22.42 | 1184 | 754 |
| Q5_1 | 23.54 | 14.96 | 1.57x | 12.47 / 12.81 | 22.66 / 22.62 | 1193 | 764 |
| Q5_0 | 23.26 | 14.67 | 1.59x | 12.37 / 12.71 | 22.55 / 22.52 | 1179 | 750 |
| Q4_K_M | 23.25 | 14.59 | 1.60x | 12.19 / 12.53 | 22.22 / 22.18 | 1178 | 747 |
| Q4_1 | 23.31 | 14.68 | 1.59x | 12.26 / 12.60 | 22.26 / 22.22 | 1181 | 750 |
| Q4_0 | 23.33 | 14.75 | 1.58x | 12.14 / 12.47 | 22.18 / 22.14 | 1188 | 760 |
Peak alloc/rsv columns show SDPA / Sage values. Sage adds ~0.3 GB alloc overhead (INT8/FP8 quantization buffers) with no change in reserved memory.
Key findings:
- ~1.59x faster with SageAttention β consistent across all quantization levels
- VRAM unchanged β sage overhead is negligible (~0.3 GB alloc)
- GGUF + Sage stacks β Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)
Benchmark (RTX 4090)
Measured on NVIDIA RTX 4090 (24 GB), 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):
Environment: NGC nvcr.io/nvidia/pytorch:26.01-py3, Python 3.12.3, PyTorch 2.11.0+cu130, CUDA 13.0.
SageAttention built from source with TORCH_CUDA_ARCH_LIST="8.9".
| Variant | SDPA (s/it) | Sage (s/it) | Speedup | Peak alloc (GB) | Total SDPA (s) | Total Sage (s) |
|---|---|---|---|---|---|---|
| BF16 | 92.54 | 29.17 | 3.17x | 14.73 | 4665 | 1492 |
| Q8_0 | 92.51 | 29.18 | 3.17x | 13.02 | 4658 | 1493 |
| Q6_K | 92.81 | 29.41 | 3.16x | 12.58 | 4673 | 1504 |
| Q5_K_M | 92.79 | 29.43 | 3.15x | 12.36 | 4672 | 1505 |
| Q5_1 | 92.67 | 29.34 | 3.16x | 12.45 | 4667 | 1501 |
| Q5_0 | 92.64 | 29.34 | 3.16x | 12.34 | 4664 | 1500 |
| Q4_K_M | 92.62 | 29.29 | 3.16x | 12.16 | 4665 | 1502 |
| Q4_1 | 92.60 | 29.32 | 3.16x | 12.22 | 4668 | 1499 |
| Q4_0 | 92.64 | 29.32 | 3.16x | 12.11 | 4684 | 1500 |
Peak alloc is identical for SDPA/Sage (SageAttention adds no extra alloc overhead on RTX 4090). Peak reserved is ~14 GB with SDPA and ~16 GB with Sage.
Key findings (RTX 4090):
- ~3.16x faster with SageAttention β SM89 FP16 kernels deliver larger relative speedup than H200's FP8 kernels (3.16x vs 1.59x) because SDPA is slower on 4090 while Sage remains fast
- All variants fit in 24 GB β Q4_0 + Sage peaks at 12.11 GB alloc (~16 GB reserved)
- GGUF + Sage stacks β Q4_K_M + Sage: 29.29 s/it at 12.16 GB (vs BF16 SDPA: 92.54 s/it at 14.73 GB)
- No quality degradation β identical to SDPA outputs across all variants



