Papers
arxiv:2601.20332

Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Published on Jan 28
Authors:
,
,
,
,
,

Abstract

Diffusion language models use iterative denoising for text generation, but inference involves redundant computation on masked tokens; a window-based approach prunes and caches tokens to accelerate inference while maintaining performance.

AI-generated summary

Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \placeholderThe source code is available at https://github.com/vhicrgit/Window-Diffusion., a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) active tokens that are computed online, (ii) buffer tokens whose KV states are cached and periodically refreshed, and (iii) far-field tokens that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to 99times inference speedup while largely preserving generation performance.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.20332 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.20332 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.20332 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.