Papers
arxiv:2602.09870

Steer2Edit: From Activation Steering to Component-Level Editing

Published on Feb 10
· Submitted by
Chung-En, Sun
on Feb 16
Authors:
,
,
,

Abstract

Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

Community

Paper submitter

Can steering vectors be turned into permanent, interpretable weight edits?

STEER2EDIT converts steering signals into closed-form, component-level rank-1 updates on attention heads and MLP neurons — no fine-tuning required.

Instead of globally shifting activations at inference time, STEER2EDIT redistributes behavioral influence across individual components while preserving the standard forward pass and compatibility with optimized inference.

Across three behavioral control settings, we achieve strictly better attribute–utility trade-offs:

🔐 Safety alignment: +17.2% higher refusal rate at matched downstream utility

✅ Truthfulness promotion: +9.8% improvement in truthful preference

⚡ Efficient reasoning: −12.2% reduction in reasoning length on average, while maintaining accuracy

The resulting edits are sparse, architecture-preserving, and interpretable — providing a principled bridge between representation steering and weight editing.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.09870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.09870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.09870 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.