Weight Decay Improves Language Model Plasticity
Abstract
Pretraining with larger weight decay values improves model plasticity and downstream fine-tuning performance by encouraging linearly separable representations and reducing overfitting.
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
Community
Increasing weight decay during language model pretraining enhances model plasticity, enabling greater performance gains after fine-tuning even when base validation loss is worse, and highlights the need to optimize hyperparameters with downstream adaptability in mind.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness (2025)
- Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining (2025)
- Least but not Last: Fine-tuning Intermediate Principal Components for Better Performance-Forgetting Trade-Offs (2026)
- Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models (2026)
- Diversity or Precision? A Deep Dive into Next Token Prediction (2025)
- Layer-wise LoRA fine-tuning: a similarity metric approach (2026)
- Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper