Motif-Video-2B / docs /gguf-sageattention.md
smithcooly2k's picture
Duplicate from Motif-Technologies/Motif-Video-2B
ef28144

🧊 GGUF + SageAttention

See the main README for FlowDPMSolver and pipeline setup.

GGUF quantized transformer weights are available at Motif-Video-2B-GGUF, reducing VRAM with minimal quality loss. Combined with SageAttention for ~2Γ— faster attention computation.

GGUF Inference

pip install gguf
import torch
from diffusers import (
    AdaptiveProjectedGuidance,
    DPMSolverMultistepScheduler,
    GGUFQuantizationConfig,
    MotifVideoPipeline,
    MotifVideoTransformer3DModel,
)
from diffusers.utils import export_to_video
from huggingface_hub import hf_hub_download

guider = AdaptiveProjectedGuidance(
    guidance_scale=8.0,
    adaptive_projected_guidance_rescale=12.0,
    adaptive_projected_guidance_momentum=0.1,
    use_original_formulation=True,
    normalization_dims="spatial",
)

variant = "Q4_K_M"  # Options: Q4_0, Q4_1, Q4_K_M, Q5_0, Q5_1, Q5_K_M, Q6_K, Q8_0, BF16
ckpt_path = hf_hub_download(
    "Motif-Technologies/Motif-Video-2B-GGUF",
    filename=f"motifv-2b-dev-{variant}.gguf",
)

transformer = MotifVideoTransformer3DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    config="Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    torch_dtype=torch.bfloat16,
    guider=guider,
    transformer=transformer,
)

pipe.scheduler = FlowDPMSolver(
    num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
    algorithm_type="dpmsolver++",
    solver_order=2,
    prediction_type="flow_prediction",
    use_flow_sigmas=True,
    flow_shift=15.0,
)
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="A woman standing in a sunlit field as flower petals swirl around her in slow motion. Each petal floats gently through the golden light, casting tiny shadows. Her hair moves like water, and time seems to stand still.",
    negative_prompt="text overlay, graphic overlay, watermark, logo, subtitles, timestamp, broadcast graphics, UI elements, random letters, frozen pose, rigid, static expression, jerky motion, mechanical motion, discontinuous motion, flat framing, depthless, dull lighting, monotone, crushed shadows, blown-out highlights, shifting background, fading background, poor continuity, identity drift, deformation, flickering, ghosting, smearing, duplication, mutated proportions, inconsistent clothing, flat colors, desaturated, tonally compressed, poor background separation, exposure shift, uneven brightness, color balance shift",
    height=736,
    width=1280,
    num_frames=121,
    num_inference_steps=50,
    frame_rate=24,
    use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)

SageAttention (Optional, ~1.6Γ— faster)

Same prompt and seed, 1280x736, 121 frames, 50 steps. Left = SDPA, Right = SageAttention.

BF16 Q8_0 Q5_K_M Q4_K_M

SageAttention accelerates attention by quantizing Q/K to INT8 and V to FP8, reducing memory bandwidth. Works with all GGUF variants.

Install (build from source β€” PyPI only has 1.x, need 2.x):

# Set TORCH_CUDA_ARCH_LIST to match your GPU:
#   "8.0" for A100/A30
#   "8.6" for RTX 3090/3080/A40
#   "8.9" for RTX 4090/4080/4070 Ti/L40/L40S (Ada Lovelace)
#   "10.0" for RTX 5090/5080/5070 Ti (Blackwell)
#   "9.0" for H100/H200
TORCH_CUDA_ARCH_LIST="9.0" pip install git+https://github.com/thu-ml/SageAttention.git --no-build-isolation

Usage with inference.py:

python inference.py --use-sage-attention --prompt "..."

Notes:

  • Requires NVIDIA GPU with SM70+
  • SM90+ (H100, H200) β€” FP8 kernels for maximum speedup
  • SM80-SM89 (A100, RTX 3090, RTX 4090) β€” FP16 kernels (still faster than SDPA)
  • SM70-SM75 (V100, RTX 2080 Ti) β€” FP16 kernels
  • Set TORCH_CUDA_ARCH_LIST to match your GPU when building (e.g., "8.6" for RTX 3090, "8.9" for RTX 4090)
  • No quality degradation observed across all GGUF variants

Benchmark (H200)

Measured on NVIDIA H200, 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):

Variant SDPA (s/it) Sage (s/it) Speedup Peak alloc (GB) Peak rsv (GB) Total SDPA (s) Total Sage (s)
BF16 23.36 14.75 1.58x 14.78 / 15.12 24.93 / 24.90 1184 754
Q8_0 23.16 14.49 1.60x 13.10 / 13.44 23.14 / 23.11 1178 744
Q6_K 23.21 14.55 1.60x 12.62 / 12.95 22.72 / 22.69 1178 747
Q5_K_M 23.33 14.69 1.59x 12.39 / 12.72 22.45 / 22.42 1184 754
Q5_1 23.54 14.96 1.57x 12.47 / 12.81 22.66 / 22.62 1193 764
Q5_0 23.26 14.67 1.59x 12.37 / 12.71 22.55 / 22.52 1179 750
Q4_K_M 23.25 14.59 1.60x 12.19 / 12.53 22.22 / 22.18 1178 747
Q4_1 23.31 14.68 1.59x 12.26 / 12.60 22.26 / 22.22 1181 750
Q4_0 23.33 14.75 1.58x 12.14 / 12.47 22.18 / 22.14 1188 760

Peak alloc/rsv columns show SDPA / Sage values. Sage adds ~0.3 GB alloc overhead (INT8/FP8 quantization buffers) with no change in reserved memory.

Key findings:

  • ~1.59x faster with SageAttention β€” consistent across all quantization levels
  • VRAM unchanged β€” sage overhead is negligible (~0.3 GB alloc)
  • GGUF + Sage stacks β€” Q4_K_M + Sage achieves 14.59 s/it at 12.53 GB alloc (vs BF16 SDPA: 23.36 s/it at 14.78 GB)

Benchmark (RTX 4090)

Measured on NVIDIA RTX 4090 (24 GB), 1280x736, 121 frames, 50 steps, DPMSolver++ (order=2, flow_shift=15.0):

Environment: NGC nvcr.io/nvidia/pytorch:26.01-py3, Python 3.12.3, PyTorch 2.11.0+cu130, CUDA 13.0. SageAttention built from source with TORCH_CUDA_ARCH_LIST="8.9".

Variant SDPA (s/it) Sage (s/it) Speedup Peak alloc (GB) Total SDPA (s) Total Sage (s)
BF16 92.54 29.17 3.17x 14.73 4665 1492
Q8_0 92.51 29.18 3.17x 13.02 4658 1493
Q6_K 92.81 29.41 3.16x 12.58 4673 1504
Q5_K_M 92.79 29.43 3.15x 12.36 4672 1505
Q5_1 92.67 29.34 3.16x 12.45 4667 1501
Q5_0 92.64 29.34 3.16x 12.34 4664 1500
Q4_K_M 92.62 29.29 3.16x 12.16 4665 1502
Q4_1 92.60 29.32 3.16x 12.22 4668 1499
Q4_0 92.64 29.32 3.16x 12.11 4684 1500

Peak alloc is identical for SDPA/Sage (SageAttention adds no extra alloc overhead on RTX 4090). Peak reserved is ~14 GB with SDPA and ~16 GB with Sage.

Key findings (RTX 4090):

  • ~3.16x faster with SageAttention β€” SM89 FP16 kernels deliver larger relative speedup than H200's FP8 kernels (3.16x vs 1.59x) because SDPA is slower on 4090 while Sage remains fast
  • All variants fit in 24 GB β€” Q4_0 + Sage peaks at 12.11 GB alloc (~16 GB reserved)
  • GGUF + Sage stacks β€” Q4_K_M + Sage: 29.29 s/it at 12.16 GB (vs BF16 SDPA: 92.54 s/it at 14.73 GB)
  • No quality degradation β€” identical to SDPA outputs across all variants