Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs
Abstract
Diffusion language models exhibit distinct representational structures compared to autoregressive models, with hierarchical abstractions and reduced bias, enabling efficient layer-skipping inference without architectural modifications.
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
Community
First effort towards analyzing internal representation of Native dLLM (LLada) and autoregressive (AR) model (Qwen2.5-7B) initialized dLLM (Dream). Native dLLM seems to learn more abstraction in early layer which can be used for skipping layers. On the other hand, AR initialized dLLM hidden representation aligns closely with the AR model, showcasing that initialization effect persists even though dLLM is trained with diffusion loss.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing (2026)
- Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching (2026)
- MetaState: Persistent Working Memory for Discrete Diffusion Language Models (2026)
- Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? (2026)
- TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration (2026)
- Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation (2026)
- Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper