🛡️ DistilBERT Expert: Role Hijack

One-vs-rest binary expert model specialized in detecting role hijack attacks. Part of the NeurAlchemy Mixture-of-Experts (MoE) security ensemble.

Test Accuracy: 97.25%

Benchmark Results

Metric	Score
Accuracy	97.25%
Precision	0.8391
Recall	0.8233
F1	0.8311

All Expert Models — Performance Comparison

Expert	Accuracy	F1	HuggingFace Repo
direct_injection	86.85%	0.7603	Link
indirect_injection	90.01%	0.3142	Link
obfuscation	96.16%	0.7510	Link
role_hijack	97.25%	0.8311	Link
system_extraction	95.02%	0.6933	Link
tool_abuse	98.14%	0.9206	Link

Architecture

This is one of 6 expert models trained in a one-vs-rest configuration on the NeurAlchemy Threat Matrix. Each expert is a DistilBERT binary classifier that detects whether input belongs to its specialized threat class.

Quick Start

from transformers import pipeline

# Load this expert
expert = pipeline("text-classification", model="neuralchemy/distilbert-expert-role-hijack-threat-matrix")

# Each expert outputs a confidence score for its threat class
result = expert("Some suspicious input here")
print(result)

MoE Aggregation Strategies

The 6 experts are combined using one of three strategies:

Max-Confidence: highest expert confidence wins (75.9% accuracy)
Threshold (0.5): any expert above threshold triggers detection
Weighted F1: experts weighted by per-class F1 performance

Citation

@misc{neuralchemy_expert_role_hijack_2026,
  author = {NeurAlchemy},
  title = {DistilBERT Expert: Role Hijack Detection},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/neuralchemy/distilbert-expert-role-hijack-threat-matrix}
}

License: Apache 2.0 | Maintained by NeurAlchemy

Downloads last month: 20

Safetensors

Model size

67M params

Tensor type

F32

Dataset used to train neuralchemy/distilbert-expert-role-hijack-threat-matrix

Space using neuralchemy/distilbert-expert-role-hijack-threat-matrix 1

Evaluation results

accuracy on neuralchemy/prompt-injection-Threat-Matrix
self-reported

0.973
f1 on neuralchemy/prompt-injection-Threat-Matrix
self-reported

0.831
precision on neuralchemy/prompt-injection-Threat-Matrix
self-reported

0.839
recall on neuralchemy/prompt-injection-Threat-Matrix
self-reported

0.823