arxiv:2602.09870

Steer2Edit: From Activation Steering to Component-Level Editing

Published on Feb 10

· Submitted by

Chung-En, Sun on Feb 16

University of California at San Diego

Upvote

Authors:

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

View arXiv page View PDF Add to collection

Community

cesun

Paper submitter about 11 hours ago

Can steering vectors be turned into permanent, interpretable weight edits?

STEER2EDIT converts steering signals into closed-form, component-level rank-1 updates on attention heads and MLP neurons — no fine-tuning required.

Instead of globally shifting activations at inference time, STEER2EDIT redistributes behavioral influence across individual components while preserving the standard forward pass and compatibility with optimized inference.

Across three behavioral control settings, we achieve strictly better attribute–utility trade-offs:

🔐 Safety alignment: +17.2% higher refusal rate at matched downstream utility

✅ Truthfulness promotion: +9.8% improvement in truthful preference

⚡ Efficient reasoning: −12.2% reduction in reasoning length on average, while maintaining accuracy

The resulting edits are sparse, architecture-preserving, and interpretable — providing a principled bridge between representation steering and weight editing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09870 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.