Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
Abstract
MIGM-Shortcut accelerates masked image generation by learning a lightweight model that predicts feature evolution velocity, achieving over 4x speedup with maintained quality.
Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.
Community
Upload arXiv:2602.23996
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers (2026)
- Autoregressive Image Generation with Masked Bit Modeling (2026)
- Elastic Diffusion Transformer (2026)
- LINA: Linear Autoregressive Image Generative Models with Continuous Tokens (2026)
- Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation (2026)
- CHAI: CacHe Attention Inference for text2video (2026)
- Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Hi, curious if the dynamics you consider can benefit from being modelled as a Kalman Filter as shown here:
https://ieeexplore.ieee.org/document/10635585
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper