Instructions to use xxccho/rm-debias-loras-llama-margin-reg with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use xxccho/rm-debias-loras-llama-margin-reg with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
RM Debias Forget LoRAs — Llama-3.1-8B-Instruct + margin-reg fine-tune (epoch 10)
This repo bundles 3 forget LoRAs that, when subtracted from the base reward model, reduce the base's bias toward surface style without sacrificing its correctness preferences.
| Adapter folder | Pair scheme | What it captures |
|---|---|---|
style/ |
Style-only | Markdown / verbose surface-style preferences (chosen = plain & short, rejected = bullet-heavy / markdown-rich), holding correctness equal. |
correctness/ |
Correctness-only | Answer-correctness preferences (chosen = correct GSM8K solution, rejected = incorrect), holding surface style equal. |
cross/ |
Cross | Crossed preferences where surface style and correctness disagree (chosen = correct-but-plain, rejected = wrong-but-stylish), the configuration on which base RMs most often err. |
Base model: xxccho/margin_reg_baseline
subfolder=checkpoint-2460
Training data:
xxccho/gsm8k_rmbench_style
— GSM8K math problems with 6 surface-style + correctness variants per question,
yielding ~23k pairs per pair_mode.
Usage
Apply at inference time
import torch
from peft import PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# 1. Load the base reward model
BASE = "xxccho/margin_reg_baseline"
SUB = 'checkpoint-2460'
kwargs = dict(num_labels=1, torch_dtype=torch.bfloat16, device_map="auto")
if SUB:
base = AutoModelForSequenceClassification.from_pretrained(BASE, subfolder=SUB, **kwargs)
tok = AutoTokenizer.from_pretrained(BASE, subfolder=SUB)
else:
base = AutoModelForSequenceClassification.from_pretrained(BASE, **kwargs)
tok = AutoTokenizer.from_pretrained(BASE)
# 2. Wrap with this LoRA (pick the pair_mode you want)
PAIR_MODE = "cross" # one of: style, correctness, cross
model = PeftModel.from_pretrained(base, "xxccho/rm-debias-loras-llama-margin-reg", subfolder=PAIR_MODE)
# 3. Use the standard PEFT scaling-trick to *subtract* the LoRA.
# Forward of (base − λ·LoRA) is obtained by setting LoRA scaling to −λ·alpha.
# For our setup (alpha=16, rank=8 → base_scaling=2.0), λ=1.0 means we
# subtract the LoRA at full strength.
LAMBDA = 0.5
for name, module in model.named_modules():
if hasattr(module, "scaling") and isinstance(module.scaling, dict):
for k in module.scaling:
module.scaling[k] = -LAMBDA * 2.0 # alpha / r = 16/8 = 2.0
# 4. Score a (prompt, response) pair
messages = [
{"role": "user", "content": "What is 7 × 8?"},
{"role": "assistant", "content": "7 × 8 = 56."},
]
text = tok.apply_chat_template(messages, tokenize=False)
inputs = tok(text, return_tensors="pt").to(model.device)
with torch.no_grad():
reward = model(**inputs).logits.squeeze(-1).item()
print(f"reward = {reward:.3f}")
Choosing λ
A small λ sweep on a held-out RM benchmark is recommended. Typical sweet spots we observed for this base model:
| Pair mode | Reasonable λ range (subtract) |
|---|---|
| style | 0.2 – 0.7 |
| correctness | 0.2 – 0.5 |
| cross | 0.3 – 1.0 |
Larger λ ⇒ stronger debias but more capacity loss; tune per downstream task.
How these LoRAs were trained
We train the LoRA to predict the WRONG preference on each pair, i.e. we intentionally fit the bias direction. Then at inference we subtract the LoRA from the base RM — task-arithmetic style. Conceptually:
base_RM: prefers chosen over rejected (correct)
LoRA: prefers rejected over chosen (bias direction)
debiased: base − λ · LoRA ⇒ same correctness, less style bias
Per-pair-mode CLI used (from the train_style_lora_gsm8k.py script in the source repo):
python scripts/train_style_lora_gsm8k.py \
--model "xxccho/margin_reg_baseline/checkpoint-2460" \
--data "rm_gsm8k_dataset_builder/generated/gsm8k_rmbench_train_clean.jsonl" \
--pair-mode <style | correctness | cross> \
--direction undesired \
--epochs 3 \
--learning-rate 1e-5
LoRA hyperparameters (encoded in adapter_config.json):
- target modules:
q_proj,k_proj,v_proj,o_proj(attention only) - rank
r=8,alpha=16(base scalingα/r = 2.0) - dropout 0.05
Reproducing the evaluation
After applying the LoRA with negative scaling (= subtract), evaluate on RM-Bench / Reward-Bench 2 to confirm the bias reduction:
python scripts/eval_reward_bench2.py \
--base_model "xxccho/margin_reg_baseline/checkpoint-2460" \
--lora_path <local clone of this repo>/<pair_mode> \
--output_path rb2_<pair_mode>.json \
--lambdas -1.0 -0.5 -0.2 0.0 0.1 0.2 0.3 0.5 0.7 1.0 2.0 3.0
(λ = +x in this script means "subtract x·LoRA"; sign convention matches
the script's --lambdas arg.)
License
LoRA adapter weights are released under Apache-2.0. The base reward model
xxccho/margin_reg_baseline is governed by
its own license — please review that before redistribution.
- Downloads last month
- -
Model tree for xxccho/rm-debias-loras-llama-margin-reg
Base model
meta-llama/Llama-3.1-8B