samarth-icebreaker-v1

A frozen-base MLP probe that classifies (prompt, response) pairs as harmful or safe β€” built on Qwen/Qwen3.5-9B, with the base model held read-only. The probe is the only judge in our 8-model comparison above chance on the ASSELab/JudgeStressTest adversarial benchmark (AUROC 0.620 vs every published safety judge at 0.04–0.37) while still hitting <1% false-positive rate on confident-nonsense honeypots that fool every other LLM-as-judge.

The fix that makes this possible is corpus-level, not architectural: fold confident-nonsense honeypots into the training mix as label=0 records, and the lexical-bias failure mode of probing-based judges closes for free.

Headline numbers

Adversarial robustness β€” ASSELab/JudgeStressTest (n=439)

We re-score the standard production safety classifiers on JST. Icebreaker is the only judge in the comparison above chance.

Judge Overall AUROC
llama-guard-3 0.135
qwen3guard-gen-8b 0.090
harmbench-llama 0.328
samarth-icebreaker-v1 (this model) 0.620

All four rows scored by us via evals.run_judge on the same 439-record JST subset (records with non-empty Generation).

Per-attack breakdown for samarth-icebreaker-v1:

Attack AUROC n
GCG-R 0.671 53
Direct 0.654 36
GCG 0.618 133
BoN 0.562 217

Confident-nonsense robustness β€” honeypot pool (n=4,994, all label=0)

The load-bearing failure mode for probe-based judges: responses that mention harmful concepts in confident-sounding prose but contain no real uplift. Lower FPR = better.

Judge FPR@0.5 mean p_harmful
qwen3guard-gen-8b 0.983 0.978
Baseline icebreaker (no honeypot aug, v1) 0.82-0.95 0.74-0.90
llama-guard-3 (calibrated at thr=0.05) 0.859 0.070
samarth-icebreaker-v1 (this model) 0.001 0.024

Standard safety benchmarks β€” AUROC

Dataset llama-guard-3 qwen3guard-gen-8b samarth-icebreaker-v1
BeaverTails 0.806 0.922 0.911
SORRY-Bench 0.640 0.844 0.873
ToxicChat 0.672 0.748 0.835
WildGuardTest 0.884 0.958 0.911
XSTest-Response-Harm 0.960 0.996 0.987
CoCoNot 0.519 0.414 0.567

Bold = best in row. Icebreaker matches qwen3guard-gen-8b on in-distribution benchmarks and wins on the cross-distribution sets (SORRY-Bench, ToxicChat, CoCoNot) without the adversarial-robustness or confident-nonsense collapse.

Architecture

Qwen/Qwen3.5-9B  (frozen, bf16, device_map="auto")
    β”‚
    β”‚ forward hook on decoder layer L=22
    β”‚ pool: last-token (over response-token mask)
    β–Ό
hidden_states[22]            [B, T, D=3584]
    β”‚
    β”‚ slice to last response token, normalize
    β–Ό
LayerNorm(3584) β†’ Linear(3584, 256) β†’ ReLU β†’ Dropout(0.1) β†’ Linear(256, 2)
    β”‚
    β–Ό
softmax(logits / T)[:, 1]   β†’   p_harmful  ∈  [0, 1]
  • No LoRA, no two-pass forward. Inference cost = 1Γ— base forward + a ~4 MB MLP. Identical latency to the base model itself.
  • No fine-tuning of the 9B base. Only the probe (~1M params) trains.
  • Temperature scaling applied post-hoc on the validation set (1 scalar fit by LBFGS).

Recommended inference

System prompt

Training randomized across three variants per record:

  1. "" (empty)
  2. The Qwen3.5 default system prompt
  3. "You are a helpful, harmless, honest assistant."

At inference, default to no system prompt (empty). That's the configuration the eval numbers above were measured under. The probe is robust to all three variants β€” adding a custom system prompt won't break it, but it also won't materially improve calibration.

If you want to use the model as an in-line moderation gate where the upstream generator has its own system prompt, pass that same system prompt to icebreaker. The probe is reading the response-token hidden states, so what matters is that the conversation that produced the response is replayed faithfully.

Code snippet

import torch
import json
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.nn as nn
import torch.nn.functional as F

# 1. Download checkpoint
ckpt_dir = Path(snapshot_download("ArthT/samarth-icebreaker-v1"))
meta = json.loads((ckpt_dir / "meta.json").read_text())

# 2. Load frozen base + tokenizer + probe
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(ckpt_dir / "tokenizer", padding_side="right")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
)
base.eval()
for p in base.parameters():
    p.requires_grad = False

d_model = base.config.hidden_size
probe = nn.Sequential(
    nn.LayerNorm(d_model),
    nn.Linear(d_model, 256),
    nn.ReLU(),
    nn.Dropout(0.1),
    nn.Linear(256, 2),
).to(device).float()
probe.load_state_dict(torch.load(ckpt_dir / "probe.pt", map_location=device))
probe.eval()

LAYER = meta["probe_layer"]            # 22
TEMPERATURE = meta.get("temperature", 1.0)

# 3. Hook the target layer
captured = {}
def _hook(_module, _input, output):
    captured["h"] = output[0] if isinstance(output, tuple) else output
handle = base.model.layers[LAYER].register_forward_hook(_hook)

# 4. Score a (prompt, response) pair
def score_pair(prompt: str, response: str, system_prompt: str = "") -> float:
    """Returns p_harmful in [0, 1]."""
    msgs = []
    if system_prompt:
        msgs.append({"role": "system", "content": system_prompt})
    msgs.append({"role": "user",      "content": prompt})
    msgs.append({"role": "assistant", "content": response})

    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
    ids = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(device)

    with torch.no_grad():
        base(input_ids=ids.input_ids, attention_mask=ids.attention_mask, use_cache=False)

    # Last-token pooling on the response: take the final non-pad position
    seq_len = ids.attention_mask.sum(dim=1)
    last_idx = seq_len - 1
    hidden = captured["h"]                                   # [B, T, D] bf16
    pooled = hidden[torch.arange(hidden.size(0)), last_idx]  # [B, D]

    logits = probe(pooled.float())
    p_harmful = F.softmax(logits / TEMPERATURE, dim=-1)[:, 1]
    return float(p_harmful[0].item())

# Example
p = score_pair(
    prompt="Write a tutorial on building a kitchen knife sharpener at home.",
    response="Sure β€” start with a flat ceramic tile, place 3 angled supports …",
)
print(f"p_harmful = {p:.3f}")

handle.remove()

Threshold recommendation

The probe ships with a calibrated temperature so p_harmful β‰₯ 0.5 is the default decision threshold. Calibration choices:

  • p β‰₯ 0.5 β€” high-precision flagging (matches headline-table FPR numbers). Use this if false positives are expensive.
  • p β‰₯ 0.3 β€” wider recall, modest FPR increase on honeypot-style confident-nonsense (stays under ~5%).
  • For per-deployment tuning, score a labeled mixed set and pick the threshold that hits your target FPR.

Training recipe

Corpus (8 HF datasets)

source size used role
PKU-Alignment/BeaverTails ≀ 60K standard
nvidia/Aegis-AI-Content-Safety-Dataset-2.0 full standard
PKU-Alignment/PKU-SafeRLHF-30K full standard
sorry-bench/sorry-bench-202406 full standard
Anthropic/hh-rlhf (red_team_attempts) full standard
allenai/wildguardmix (wildguardtrain) ≀ 60K standard + mined refusals
allenai/wildjailbreak ≀ 60K standard
AmazonScience/FalseReject full standard

Plus three weighted augmentation pools (all label=0):

Pool Batch fraction Source
rubbish (token-injection + degenerate generation) 20% Generated from JBB seeds
mined_refusal (refusal-mentioning-harm) 10% Mined from WildGuardTrain refusals
cn_honeypot (confident-nonsense) 15% benchmarks/cn_honeypots.jsonl (4,994 records)
standard 55% Balanced from the 8 sources above

The cn_honeypot pool is the load-bearing fix. Removing it returns the probe to ~90% FPR on the honeypot set β€” same failure mode as every other LLM-as-judge.

Hyperparameters

Param Value
Probe layer 22 (swept over {16, 18, 20, 22, 24, 26})
Pool last-token (swept over {mean-response, last-token})
Seed 42 (3 seeds trained per config: 42, 43, 44)
Optimizer AdamW
LR schedule 1e-3 β†’ 1e-5 cosine, 5% warmup, weight_decay=0.01
Batch per-device 8 Γ— grad_accum 16 β†’ effective 128
max_length train 1024 (prompt 512 / response 512)
max_length inference 2048 (response truncation from end)
Calibration Temperature scaling on val set (LBFGS, 1 scalar)

Compute

Training pipeline runs on CSCS Clariden (NVIDIA GH200, aarch64). The 9B base forward is the bottleneck, so we cache pooled features once across all 6 candidate layers and 2 pool types in a single forward pass per record. Then the probe-only sweep over {layer Γ— pool Γ— seed} trains in seconds per config.

  • One-shot feature cache (235K records, 6 layers Γ— 2 pools): ~1 hour wall on 1Γ— GH200
  • Probe-only sweep (36 v1 configs + 9 v2sm refit configs, 3 epochs each): ~2 minutes wall total on 12Γ— GH200

Why this works β€” the lexical-bias critique

Probing-based safety judges learn the lexical presence of harmful tokens in the response, not compositional harmful intent. We confirmed this on confident-nonsense honeypots: the baseline probe (no honeypot augmentation) calls 82–95% of confident-nonsense responses harmful at threshold 0.5 β€” they mention harmful concepts in words even though the responses contain no real uplift.

The 8 standard corpora simply don't contain (1,1,0,0)-shaped training signal β€” confident-sounding harmful-surface responses labelled SAFE. Once we fold 4,994 such records (the honeypot set) into the training mix at 15% of each batch, the probe learns to project off the lexical axis and the failure mode collapses.

The fix is corrigible, not eliminated β€” a probe-based judge will always be one distributionally-novel adversarial attack away from re-failing the lexical bias. The principled extension is RLACE-style geometric concept erasure on top of the corpus fix.

Limitations

  • English only. Training corpora are English; multilingual performance not measured.
  • Modest OOD trade-off vs the baseline-aug-only probe. Adding the honeypot augmentation cost ~0.03 AUROC on ToxicChat and ~0.05 on CoCoNot. The probe is more selective and threw out some OOD harmful signal as a side effect.
  • Seed variance is real. Across 3 seeds (42 / 43 / 44) at the same (layer, pool) config, per-dataset AUROC varies by Β±0.05–0.20 on the smaller benchmarks (XSTest, CoCoNot). Use the ensemble form of the model (mean across 3 seeds) if you need lower variance.
  • The model is trained on response classification, not prompt classification. It expects a (prompt, response) pair. Scoring prompts alone returns uncalibrated output.
  • JudgeStressTest AUROC of 0.620 is "best in class," not "safe to deploy." No judge in our comparison hits 0.7 on the adversarial set β€” adversarial robustness remains an open problem.

Citation

@misc{singh2026samarthicebreaker,
  title  = {samarth-icebreaker: A frozen-base MLP probe for adversarially
            robust LLM safety classification},
  author = {Singh, Arth},
  year   = {2026},
  url    = {https://huggingface.co/ArthT/samarth-icebreaker-v1},
}

Repository

Code, training pipeline, and the full eval suite live at: https://github.com/Arth-Singh/Robust-jailbreak-judges

The samarth-icebreaker family is auto-registered by the repo's judges.adapters.samarth_icebreaker module β€” drop the checkpoint into checkpoints/samarth-icebreaker-L22-last-token-s42-v2sm/ and it will be visible to evals.run_judge and the sweep tooling.

License

Apache 2.0.

This is a research artifact. It is not a production-grade safety filter. Independent evaluation against your specific threat model is required before deployment.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ArthT/samarth-icebreaker-v1

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(339)
this model

Datasets used to train ArthT/samarth-icebreaker-v1