🧊 GGUF + SageAttention

See the main README for FlowDPMSolver and pipeline setup.

GGUF quantized transformer weights are available at Motif-Video-2B-GGUF, reducing VRAM with minimal quality loss. Combined with SageAttention for ~2× faster attention computation.

GGUF Inference

pip install gguf

import torch
from diffusers import (
    AdaptiveProjectedGuidance,
    DPMSolverMultistepScheduler,
    GGUFQuantizationConfig,
    MotifVideoPipeline,
    MotifVideoTransformer3DModel,
)
from diffusers.utils import export_to_video
from huggingface_hub import hf_hub_download

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
    normalization_dims="spatial",
)

variant = "Q4_K_M"  # Options: Q4_0, Q4_1, Q4_K_M, Q5_0, Q5_1, Q5_K_M, Q6_K, Q8_0, BF16
ckpt_path = hf_hub_download(
    "Motif-Technologies/Motif-Video-2B-GGUF",
    filename=f"motifv-2b-dev-{variant}.gguf",
)

transformer = MotifVideoTransformer3DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    config="Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    torch_dtype=torch.bfloat16,
    guider=guider,
    transformer=transformer,
)

pipe.scheduler = FlowDPMSolver(
    num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
    algorithm_type="dpmsolver++",
    solver_order=2,
    prediction_type="flow_prediction",
    use_flow_sigmas=True,
    flow_shift=15.0,
)
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
    negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
    frame_rate=24,
    use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)

SageAttention (Optional, ~1.6× faster)

Same prompt and seed, 1280x736, 121 frames, 50 steps. Left = SDPA, Right = SageAttention.

SageAttention accelerates attention by quantizing Q/K to INT8 and V to FP8, reducing memory bandwidth. Works with all GGUF variants.

Install (build from source — PyPI only has 1.x, need 2.x):

# Set TORCH_CUDA_ARCH_LIST to match your GPU:
#   "8.0" for A100/A30
#   "8.6" for RTX 3090/3080/A40
#   "8.9" for RTX 4090/4080/4070 Ti/L40/L40S (Ada Lovelace)
#   "10.0" for RTX 5090/5080/5070 Ti (Blackwell)
#   "9.0" for H100/H200
TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation

Usage with inference.py:

python inference.py --use-sage-attention --prompt "..."

Notes:

Requires NVIDIA GPU with SM70+
SM90+ (H100, H200) — FP8 kernels for maximum speedup
SM80-SM89 (A100, RTX 3090, RTX 4090) — FP16 kernels (still faster than SDPA)
SM70-SM75 (V100, RTX 2080 Ti) — FP16 kernels
Set TORCH_CUDA_ARCH_LIST to match your GPU when building (e.g., "8.6" for RTX 3090, "8.9" for RTX 4090)
No quality degradation observed across all GGUF variants

Benchmark (H200)

Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):

Variant	SDPA (s/it)	Sage (s/it)	Speedup	Peak alloc (GB)	Peak rsv (GB)	Total SDPA (s)	Total Sage (s)
BF16	23.36	14.75	1.58x	14.78 / 15.12	24.93 / 24.90	1184	754
Q8_0	23.16	14.49	1.60x	13.10 / 13.44	23.14 / 23.11	1178	744
Q6_K	23.21	14.55	1.60x	12.62 / 12.95	22.72 / 22.69	1178	747
Q5_K_M	23.33	14.69	1.59x	12.39 / 12.72	22.45 / 22.42	1184	754
Q5_1	23.54	14.96	1.57x	12.47 / 12.81	22.66 / 22.62	1193	764
Q5_0	23.26	14.67	1.59x	12.37 / 12.71	22.55 / 22.52	1179	750
Q4_K_M	23.25	14.59	1.60x	12.19 / 12.53	22.22 / 22.18	1178	747
Q4_1	23.31	14.68	1.59x	12.26 / 12.60	22.26 / 22.22	1181	750
Q4_0	23.33	14.75	1.58x	12.14 / 12.47	22.18 / 22.14	1188	760

Peak alloc/rsv columns show SDPA / Sage values. Sage adds ~0.3 GB alloc overhead (INT8/FP8 quantization buffers) with no change in reserved memory.

Key findings:

~1.59x faster with SageAttention — consistent across all quantization levels
VRAM unchanged — sage overhead is negligible (~0.3 GB alloc)
GGUF + Sage stacks — Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)

Benchmark (RTX 4090)

Measured on NVIDIA RTX 4090 (24 GB), 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):

Environment: NGC nvcr.io/nvidia/pytorch:26.01-py3, Python 3.12.3, PyTorch 2.11.0+cu130, CUDA 13.0. SageAttention built from source with TORCH_CUDA_ARCH_LIST="8.9".

Variant	SDPA (s/it)	Sage (s/it)	Speedup	Peak alloc (GB)	Total SDPA (s)	Total Sage (s)
BF16	92.54	29.17	3.17x	14.73	4665	1492
Q8_0	92.51	29.18	3.17x	13.02	4658	1493
Q6_K	92.81	29.41	3.16x	12.58	4673	1504
Q5_K_M	92.79	29.43	3.15x	12.36	4672	1505
Q5_1	92.67	29.34	3.16x	12.45	4667	1501
Q5_0	92.64	29.34	3.16x	12.34	4664	1500
Q4_K_M	92.62	29.29	3.16x	12.16	4665	1502
Q4_1	92.60	29.32	3.16x	12.22	4668	1499
Q4_0	92.64	29.32	3.16x	12.11	4684	1500

Peak alloc is identical for SDPA/Sage (SageAttention adds no extra alloc overhead on RTX 4090). Peak reserved is ~14 GB with SDPA and ~16 GB with Sage.

Key findings (RTX 4090):

~3.16x faster with SageAttention — SM89 FP16 kernels deliver larger relative speedup than H200's FP8 kernels (3.16x vs 1.59x) because SDPA is slower on 4090 while Sage remains fast
All variants fit in 24 GB — Q4_0 + Sage peaks at 12.11 GB alloc (~16 GB reserved)
GGUF + Sage stacks — Q4_K_M + Sage: 29.29 s/it at 12.16 GB (vs BF16 SDPA: 92.54 s/it at 14.73 GB)
No quality degradation — identical to SDPA outputs across all variants