RM Debias Forget LoRAs — Llama-3.1-8B-Instruct + margin-reg fine-tune (epoch 10)

This repo bundles 3 forget LoRAs that, when subtracted from the base reward model, reduce the base's bias toward surface style without sacrificing its correctness preferences.

Adapter folder Pair scheme What it captures
style/ Style-only Markdown / verbose surface-style preferences (chosen = plain & short, rejected = bullet-heavy / markdown-rich), holding correctness equal.
correctness/ Correctness-only Answer-correctness preferences (chosen = correct GSM8K solution, rejected = incorrect), holding surface style equal.
cross/ Cross Crossed preferences where surface style and correctness disagree (chosen = correct-but-plain, rejected = wrong-but-stylish), the configuration on which base RMs most often err.

Base model: xxccho/margin_reg_baseline subfolder=checkpoint-2460

Training data: xxccho/gsm8k_rmbench_style — GSM8K math problems with 6 surface-style + correctness variants per question, yielding ~23k pairs per pair_mode.


Usage

Apply at inference time

import torch
from peft import PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# 1. Load the base reward model
BASE = "xxccho/margin_reg_baseline"
SUB  = 'checkpoint-2460'
kwargs = dict(num_labels=1, torch_dtype=torch.bfloat16, device_map="auto")
if SUB:
    base = AutoModelForSequenceClassification.from_pretrained(BASE, subfolder=SUB, **kwargs)
    tok  = AutoTokenizer.from_pretrained(BASE, subfolder=SUB)
else:
    base = AutoModelForSequenceClassification.from_pretrained(BASE, **kwargs)
    tok  = AutoTokenizer.from_pretrained(BASE)

# 2. Wrap with this LoRA (pick the pair_mode you want)
PAIR_MODE = "cross"   # one of: style, correctness, cross
model = PeftModel.from_pretrained(base, "xxccho/rm-debias-loras-llama-margin-reg", subfolder=PAIR_MODE)

# 3. Use the standard PEFT scaling-trick to *subtract* the LoRA.
#    Forward of (base − λ·LoRA) is obtained by setting LoRA scaling to −λ·alpha.
#    For our setup (alpha=16, rank=8 → base_scaling=2.0), λ=1.0 means we
#    subtract the LoRA at full strength.
LAMBDA = 0.5
for name, module in model.named_modules():
    if hasattr(module, "scaling") and isinstance(module.scaling, dict):
        for k in module.scaling:
            module.scaling[k] = -LAMBDA * 2.0   # alpha / r = 16/8 = 2.0

# 4. Score a (prompt, response) pair
messages = [
    {"role": "user", "content": "What is 7 × 8?"},
    {"role": "assistant", "content": "7 × 8 = 56."},
]
text = tok.apply_chat_template(messages, tokenize=False)
inputs = tok(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    reward = model(**inputs).logits.squeeze(-1).item()
print(f"reward = {reward:.3f}")

Choosing λ

A small λ sweep on a held-out RM benchmark is recommended. Typical sweet spots we observed for this base model:

Pair mode Reasonable λ range (subtract)
style 0.2 – 0.7
correctness 0.2 – 0.5
cross 0.3 – 1.0

Larger λ ⇒ stronger debias but more capacity loss; tune per downstream task.


How these LoRAs were trained

We train the LoRA to predict the WRONG preference on each pair, i.e. we intentionally fit the bias direction. Then at inference we subtract the LoRA from the base RM — task-arithmetic style. Conceptually:

base_RM:  prefers chosen over rejected (correct)
LoRA:     prefers rejected over chosen (bias direction)
debiased: base − λ · LoRA  ⇒ same correctness, less style bias

Per-pair-mode CLI used (from the train_style_lora_gsm8k.py script in the source repo):

python scripts/train_style_lora_gsm8k.py \
    --model "xxccho/margin_reg_baseline/checkpoint-2460" \
    --data  "rm_gsm8k_dataset_builder/generated/gsm8k_rmbench_train_clean.jsonl" \
    --pair-mode  <style | correctness | cross> \
    --direction  undesired \
    --epochs     3 \
    --learning-rate 1e-5

LoRA hyperparameters (encoded in adapter_config.json):

  • target modules: q_proj, k_proj, v_proj, o_proj (attention only)
  • rank r=8, alpha=16 (base scaling α/r = 2.0)
  • dropout 0.05

Reproducing the evaluation

After applying the LoRA with negative scaling (= subtract), evaluate on RM-Bench / Reward-Bench 2 to confirm the bias reduction:

python scripts/eval_reward_bench2.py \
    --base_model "xxccho/margin_reg_baseline/checkpoint-2460" \
    --lora_path  <local clone of this repo>/<pair_mode> \
    --output_path rb2_<pair_mode>.json \
    --lambdas    -1.0 -0.5 -0.2 0.0 0.1 0.2 0.3 0.5 0.7 1.0 2.0 3.0

(λ = +x in this script means "subtract x·LoRA"; sign convention matches the script's --lambdas arg.)


License

LoRA adapter weights are released under Apache-2.0. The base reward model xxccho/margin_reg_baseline is governed by its own license — please review that before redistribution.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xxccho/rm-debias-loras-llama-margin-reg