๐ก๏ธ DistilBERT Expert: Role Hijack
One-vs-rest binary expert model specialized in detecting role hijack attacks. Part of the NeurAlchemy Mixture-of-Experts (MoE) security ensemble.
Test Accuracy: 97.25%
Benchmark Results
| Metric | Score |
|---|---|
| Accuracy | 97.25% |
| Precision | 0.8391 |
| Recall | 0.8233 |
| F1 | 0.8311 |
All Expert Models โ Performance Comparison
| Expert | Accuracy | F1 | HuggingFace Repo |
|---|---|---|---|
| direct_injection | 86.85% | 0.7603 | Link |
| indirect_injection | 90.01% | 0.3142 | Link |
| obfuscation | 96.16% | 0.7510 | Link |
| role_hijack | 97.25% | 0.8311 | Link |
| system_extraction | 95.02% | 0.6933 | Link |
| tool_abuse | 98.14% | 0.9206 | Link |
Architecture
This is one of 6 expert models trained in a one-vs-rest configuration on the NeurAlchemy Threat Matrix. Each expert is a DistilBERT binary classifier that detects whether input belongs to its specialized threat class.
Quick Start
from transformers import pipeline
# Load this expert
expert = pipeline("text-classification", model="neuralchemy/distilbert-expert-role-hijack-threat-matrix")
# Each expert outputs a confidence score for its threat class
result = expert("Some suspicious input here")
print(result)
MoE Aggregation Strategies
The 6 experts are combined using one of three strategies:
- Max-Confidence: highest expert confidence wins (75.9% accuracy)
- Threshold (0.5): any expert above threshold triggers detection
- Weighted F1: experts weighted by per-class F1 performance
Citation
@misc{neuralchemy_expert_role_hijack_2026,
author = {NeurAlchemy},
title = {DistilBERT Expert: Role Hijack Detection},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/neuralchemy/distilbert-expert-role-hijack-threat-matrix}
}
License: Apache 2.0 | Maintained by NeurAlchemy
- Downloads last month
- 20
Dataset used to train neuralchemy/distilbert-expert-role-hijack-threat-matrix
Space using neuralchemy/distilbert-expert-role-hijack-threat-matrix 1
Evaluation results
- accuracy on neuralchemy/prompt-injection-Threat-Matrixself-reported0.973
- f1 on neuralchemy/prompt-injection-Threat-Matrixself-reported0.831
- precision on neuralchemy/prompt-injection-Threat-Matrixself-reported0.839
- recall on neuralchemy/prompt-injection-Threat-Matrixself-reported0.823