samarth-icebreaker-v1

A frozen-base MLP probe that classifies (prompt, response) pairs as harmful or safe — built on Qwen/Qwen3.5-9B, with the base model held read-only. The probe is the only judge in our 8-model comparison above chance on the ASSELab/JudgeStressTest adversarial benchmark (AUROC 0.620 vs every published safety judge at 0.04–0.37) while still hitting <1% false-positive rate on confident-nonsense honeypots that fool every other LLM-as-judge.

The fix that makes this possible is corpus-level, not architectural: fold confident-nonsense honeypots into the training mix as label=0 records, and the lexical-bias failure mode of probing-based judges closes for free.

Headline numbers

Adversarial robustness — ASSELab/JudgeStressTest (n=439)

We re-score the standard production safety classifiers on JST. Icebreaker is the only judge in the comparison above chance.

Judge	Overall AUROC
llama-guard-3	0.135
qwen3guard-gen-8b	0.090
harmbench-llama	0.328
samarth-icebreaker-v1 (this model)	0.620

All four rows scored by us via evals.run_judge on the same 439-record JST subset (records with non-empty Generation).

Per-attack breakdown for samarth-icebreaker-v1:

Attack	AUROC	n
GCG-R	0.671	53
Direct	0.654	36
GCG	0.618	133
BoN	0.562	217

Confident-nonsense robustness — honeypot pool (n=4,994, all label=0)

The load-bearing failure mode for probe-based judges: responses that mention harmful concepts in confident-sounding prose but contain no real uplift. Lower FPR = better.

Judge	FPR@0.5	mean p_harmful
qwen3guard-gen-8b	0.983	0.978
Baseline icebreaker (no honeypot aug, v1)	0.82-0.95	0.74-0.90
llama-guard-3 (calibrated at thr=0.05)	0.859	0.070
samarth-icebreaker-v1 (this model)	0.001	0.024

Standard safety benchmarks — AUROC

Dataset	llama-guard-3	qwen3guard-gen-8b	samarth-icebreaker-v1
BeaverTails	0.806	0.922	0.911
SORRY-Bench	0.640	0.844	0.873
ToxicChat	0.672	0.748	0.835
WildGuardTest	0.884	0.958	0.911
XSTest-Response-Harm	0.960	0.996	0.987
CoCoNot	0.519	0.414	0.567

Bold = best in row. Icebreaker matches qwen3guard-gen-8b on in-distribution benchmarks and wins on the cross-distribution sets (SORRY-Bench, ToxicChat, CoCoNot) without the adversarial-robustness or confident-nonsense collapse.

Architecture

Qwen/Qwen3.5-9B  (frozen, bf16, device_map="auto")
    │
    │ forward hook on decoder layer L=22
    │ pool: last-token (over response-token mask)
    ▼
hidden_states[22]            [B, T, D=3584]
    │
    │ slice to last response token, normalize
    ▼
LayerNorm(3584) → Linear(3584, 256) → ReLU → Dropout(0.1) → Linear(256, 2)
    │
    ▼
softmax(logits / T)[:, 1]   →   p_harmful  ∈  [0, 1]

No LoRA, no two-pass forward. Inference cost = 1× base forward + a ~4 MB MLP. Identical latency to the base model itself.
No fine-tuning of the 9B base. Only the probe (~1M params) trains.
Temperature scaling applied post-hoc on the validation set (1 scalar fit by LBFGS).

Recommended inference

System prompt

Training randomized across three variants per record:

"" (empty)
The Qwen3.5 default system prompt
"You are a helpful, harmless, honest assistant."

At inference, default to no system prompt (empty). That's the configuration the eval numbers above were measured under. The probe is robust to all three variants — adding a custom system prompt won't break it, but it also won't materially improve calibration.

If you want to use the model as an in-line moderation gate where the upstream generator has its own system prompt, pass that same system prompt to icebreaker. The probe is reading the response-token hidden states, so what matters is that the conversation that produced the response is replayed faithfully.

Code snippet

import torch
import json
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.nn as nn
import torch.nn.functional as F

# 1. Download checkpoint
ckpt_dir = Path(snapshot_download("ArthT/samarth-icebreaker-v1"))
meta = json.loads((ckpt_dir / "meta.json").read_text())

# 2. Load frozen base + tokenizer + probe
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(ckpt_dir / "tokenizer", padding_side="right")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
base.eval()
for p in base.parameters():
    p.requires_grad = False

d_model = base.config.hidden_size
probe = nn.Sequential(
    nn.LayerNorm(d_model),
    nn.Linear(d_model, 256),
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(256, 2),
).to(device).float()
probe.load_state_dict(torch.load(ckpt_dir / "probe.pt", map_location=device))
probe.eval()

LAYER = meta["probe_layer"]            # 22
TEMPERATURE = meta.get("temperature", 1.0)

# 3. Hook the target layer
captured = {}
def _hook(_module, _input, output):
    captured["h"] = output[0] if isinstance(output, tuple) else output
handle = base.model.layers[LAYER].register_forward_hook(_hook)

# 4. Score a (prompt, response) pair
def score_pair(prompt: str, response: str, system_prompt: str = "") -> float:
    """Returns p_harmful in [0, 1]."""
    msgs = []
    if system_prompt:
        msgs.append({"role": "system", "content": system_prompt})
    msgs.append({"role": "user",      "content": prompt})
    msgs.append({"role": "assistant", "content": response})

    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
    ids = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(device)

    with torch.no_grad():
        base(input_ids=ids.input_ids, attention_mask=ids.attention_mask, use_cache=False)

    # Last-token pooling on the response: take the final non-pad position
    seq_len = ids.attention_mask.sum(dim=1)
    last_idx = seq_len - 1
    hidden = captured["h"]                                   # [B, T, D] bf16
    pooled = hidden[torch.arange(hidden.size(0)), last_idx]  # [B, D]

    logits = probe(pooled.float())
    p_harmful = F.softmax(logits / TEMPERATURE, dim=-1)[:, 1]
    return float(p_harmful[0].item())

# Example
p = score_pair(
    prompt="Write a tutorial on building a kitchen knife sharpener at home.",
    response="Sure — start with a flat ceramic tile, place 3 angled supports …",
)
print(f"p_harmful = {p:.3f}")

handle.remove()

Threshold recommendation

The probe ships with a calibrated temperature so p_harmful ≥ 0.5 is the default decision threshold. Calibration choices:

p ≥ 0.5 — high-precision flagging (matches headline-table FPR numbers). Use this if false positives are expensive.
p ≥ 0.3 — wider recall, modest FPR increase on honeypot-style confident-nonsense (stays under ~5%).
For per-deployment tuning, score a labeled mixed set and pick the threshold that hits your target FPR.

Training recipe

Corpus (8 HF datasets)

source	size used	role
`PKU-Alignment/BeaverTails`	≤ 60K	standard
`nvidia/Aegis-AI-Content-Safety-Dataset-2.0`	full	standard
`PKU-Alignment/PKU-SafeRLHF-30K`	full	standard
`sorry-bench/sorry-bench-202406`	full	standard
`Anthropic/hh-rlhf` (red_team_attempts)	full	standard
`allenai/wildguardmix` (wildguardtrain)	≤ 60K	standard + mined refusals
`allenai/wildjailbreak`	≤ 60K	standard
`AmazonScience/FalseReject`	full	standard

Plus three weighted augmentation pools (all label=0):

Pool	Batch fraction	Source
`rubbish` (token-injection + degenerate generation)	20%	Generated from JBB seeds
`mined_refusal` (refusal-mentioning-harm)	10%	Mined from WildGuardTrain refusals
`cn_honeypot` (confident-nonsense)	15%	`benchmarks/cn_honeypots.jsonl` (4,994 records)
`standard`	55%	Balanced from the 8 sources above

The cn_honeypot pool is the load-bearing fix. Removing it returns the probe to ~90% FPR on the honeypot set — same failure mode as every other LLM-as-judge.

Hyperparameters

Param	Value
Probe layer	22 (swept over {16, 18, 20, 22, 24, 26})
Pool	last-token (swept over {mean-response, last-token})
Seed	42 (3 seeds trained per config: 42, 43, 44)
Optimizer	AdamW
LR schedule	`1e-3 → 1e-5` cosine, 5% warmup, `weight_decay=0.01`
Batch	per-device 8 × grad_accum 16 → effective 128
`max_length` train	1024 (prompt 512 / response 512)
`max_length` inference	2048 (response truncation from end)
Calibration	Temperature scaling on val set (LBFGS, 1 scalar)

Compute

Training pipeline runs on CSCS Clariden (NVIDIA GH200, aarch64). The 9B base forward is the bottleneck, so we cache pooled features once across all 6 candidate layers and 2 pool types in a single forward pass per record. Then the probe-only sweep over {layer × pool × seed} trains in seconds per config.

One-shot feature cache (235K records, 6 layers × 2 pools): ~1 hour wall on 1× GH200
Probe-only sweep (36 v1 configs + 9 v2sm refit configs, 3 epochs each): ~2 minutes wall total on 12× GH200

Why this works — the lexical-bias critique

Probing-based safety judges learn the lexical presence of harmful tokens in the response, not compositional harmful intent. We confirmed this on confident-nonsense honeypots: the baseline probe (no honeypot augmentation) calls 82–95% of confident-nonsense responses harmful at threshold 0.5 — they mention harmful concepts in words even though the responses contain no real uplift.

The 8 standard corpora simply don't contain (1,1,0,0)-shaped training signal — confident-sounding harmful-surface responses labelled SAFE. Once we fold 4,994 such records (the honeypot set) into the training mix at 15% of each batch, the probe learns to project off the lexical axis and the failure mode collapses.

The fix is corrigible, not eliminated — a probe-based judge will always be one distributionally-novel adversarial attack away from re-failing the lexical bias. The principled extension is RLACE-style geometric concept erasure on top of the corpus fix.

Limitations

English only. Training corpora are English; multilingual performance not measured.
Modest OOD trade-off vs the baseline-aug-only probe. Adding the honeypot augmentation cost ~0.03 AUROC on ToxicChat and ~0.05 on CoCoNot. The probe is more selective and threw out some OOD harmful signal as a side effect.
Seed variance is real. Across 3 seeds (42 / 43 / 44) at the same (layer, pool) config, per-dataset AUROC varies by ±0.05–0.20 on the smaller benchmarks (XSTest, CoCoNot). Use the ensemble form of the model (mean across 3 seeds) if you need lower variance.
The model is trained on response classification, not prompt classification. It expects a (prompt, response) pair. Scoring prompts alone returns uncalibrated output.
JudgeStressTest AUROC of 0.620 is "best in class," not "safe to deploy." No judge in our comparison hits 0.7 on the adversarial set — adversarial robustness remains an open problem.

Citation

@misc{singh2026samarthicebreaker,
  title  = {samarth-icebreaker: A frozen-base MLP probe for adversarially
            robust LLM safety classification},
  author = {Singh, Arth},
  year   = {2026},
  url    = {https://huggingface.co/ArthT/samarth-icebreaker-v1},
}

Repository

Code, training pipeline, and the full eval suite live at: https://github.com/Arth-Singh/Robust-jailbreak-judges

The samarth-icebreaker family is auto-registered by the repo's judges.adapters.samarth_icebreaker module — drop the checkpoint into checkpoints/samarth-icebreaker-L22-last-token-s42-v2sm/ and it will be visible to evals.run_judge and the sweep tooling.

License

Apache 2.0.

This is a research artifact. It is not a production-grade safety filter. Independent evaluation against your specific threat model is required before deployment.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ArthT/samarth-icebreaker-v1

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B