๐Ÿ›ก๏ธ DistilBERT Expert: Role Hijack

One-vs-rest binary expert model specialized in detecting role hijack attacks. Part of the NeurAlchemy Mixture-of-Experts (MoE) security ensemble.

Test Accuracy: 97.25%

Benchmark Results

Metric Score
Accuracy 97.25%
Precision 0.8391
Recall 0.8233
F1 0.8311

All Expert Models โ€” Performance Comparison

Expert Accuracy F1 HuggingFace Repo
direct_injection 86.85% 0.7603 Link
indirect_injection 90.01% 0.3142 Link
obfuscation 96.16% 0.7510 Link
role_hijack 97.25% 0.8311 Link
system_extraction 95.02% 0.6933 Link
tool_abuse 98.14% 0.9206 Link

Architecture

This is one of 6 expert models trained in a one-vs-rest configuration on the NeurAlchemy Threat Matrix. Each expert is a DistilBERT binary classifier that detects whether input belongs to its specialized threat class.

Quick Start

from transformers import pipeline

# Load this expert
expert = pipeline("text-classification", model="neuralchemy/distilbert-expert-role-hijack-threat-matrix")

# Each expert outputs a confidence score for its threat class
result = expert("Some suspicious input here")
print(result)

MoE Aggregation Strategies

The 6 experts are combined using one of three strategies:

  • Max-Confidence: highest expert confidence wins (75.9% accuracy)
  • Threshold (0.5): any expert above threshold triggers detection
  • Weighted F1: experts weighted by per-class F1 performance

Citation

@misc{neuralchemy_expert_role_hijack_2026,
  author = {NeurAlchemy},
  title = {DistilBERT Expert: Role Hijack Detection},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/neuralchemy/distilbert-expert-role-hijack-threat-matrix}
}

License: Apache 2.0 | Maintained by NeurAlchemy

Downloads last month
20
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train neuralchemy/distilbert-expert-role-hijack-threat-matrix

Space using neuralchemy/distilbert-expert-role-hijack-threat-matrix 1

Evaluation results

  • accuracy on neuralchemy/prompt-injection-Threat-Matrix
    self-reported
    0.973
  • f1 on neuralchemy/prompt-injection-Threat-Matrix
    self-reported
    0.831
  • precision on neuralchemy/prompt-injection-Threat-Matrix
    self-reported
    0.839
  • recall on neuralchemy/prompt-injection-Threat-Matrix
    self-reported
    0.823