arxiv:2606.18246

Variable-Width Transformers

Published on Jun 16

· Submitted by

Authors:

Abstract

A novel transformer architecture with nonuniform width allocation across layers achieves better performance and efficiency compared to uniform designs by utilizing a parameter-free residual resizing mechanism.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a times-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

View arXiv page View PDF GitHub 6 Add to collection

Community

noahml

about 7 hours ago

Neat paper. The idea of using a bottleneck, x-shaped architecture to vary width across layers makes a lot of sense if different layers are doing different things. It’s pretty compelling that they managed to cut down on FLOPs and KV cache memory while still beating the uniform baselines on language modeling loss.

Do you think this "middle-narrowing" approach would hold up if you scaled the models even further, or does that bottleneck eventually start hurting performance?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/2dd9f233-0d09-404a-b00b-516953d638af

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18246

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18246 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18246 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18246 in a Space README.md to link it from this page.