AISecOps β Trained Security Models
Finetuned models powering the AISecOps AI Security Operations Platform.
These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.
Model Overview
| File | Type | Purpose | Size |
|---|---|---|---|
trajectory_model_best.pt |
PyTorch Transformer | Session-level escalation detector (best checkpoint) | 150 MB |
trajectory_model_final.pt |
PyTorch Transformer | Session-level escalation detector (final epoch) | 50 MB |
isolation_forest.pkl |
scikit-learn | One-class anomaly detector for prompt embeddings | 5.5 MB |
fusion_model.pt |
PyTorch MLP | Score fusion combiner (final stage classifier) | 21 KB |
fusion_threshold.json |
Config | Optimal decision threshold (Youden J calibration) | β |
trajectory_model_best_config.json |
Config | Trajectory model architecture spec | β |
training_feature_stats.json |
Config | Feature normalisation statistics | β |
Pipeline Position
These models run inside the AISecOps 6-layer security pipeline:
User Prompt
β
FastPreFilter (regex, <5 ms)
β
Threat Detection β isolation_forest.pkl runs here
β trajectory_model_best.pt runs here
Fusion Engine β fusion_model.pt runs here
β
Policy Decision
β
LLM / Target Endpoint
β
Output Security
β
Safe Response
Model Details
1. Trajectory Model (trajectory_model_best.pt)
A Transformer encoder that tracks session-level escalation patterns β detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.
Architecture:
| Parameter | Value |
|---|---|
| Input dimension | 1024 (E5-large-v2 embeddings) |
| Hidden dimension | 512 |
| Transformer layers | 4 |
| Attention heads | 8 |
| Dropout | 0.3 |
| Max sequence length | 6 turns |
Training inputs: Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
Output: Scalar escalation score in [0, 1].
Training data: Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.
2. Isolation Forest (isolation_forest.pkl)
A one-class anomaly detector trained exclusively on benign prompt embeddings.
- Algorithm: scikit-learn
IsolationForest - Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
- Score normalisation: Percentile-based min-max to [0, 1]
- Decision threshold: 0.5 (default)
- Logic: Any prompt that deviates from the learned safe distribution is flagged
Score interpretation:
| Score | Meaning |
|---|---|
| 0.0 | Deep inside safe distribution β very normal |
| 0.5 | Decision boundary |
| 1.0 | Highly anomalous / likely adversarial |
3. Fusion MLP (fusion_model.pt)
A small multi-layer perceptron that combines all upstream model scores into a single threat score.
Input features (6-dimensional):
| Feature | Source | Mean | Std |
|---|---|---|---|
anomaly_score |
IsolationForest | 0.538 | 0.227 |
if_score |
IsolationForest (raw) | 0.478 | 0.215 |
pattern_score |
Regex pre-filter | 0.311 | 0.341 |
max_similarity_score |
FAISS vector search | 0.515 | 0.234 |
trajectory_score |
Trajectory model | 0.497 | 0.260 |
intent_entropy |
BART zero-shot | 0.494 | 0.250 |
Output: Single scalar fusion score in [0, 1].
Decision threshold: 0.46 (calibrated by maximising Youden J on validation set, Youden J = 0.9688).
Usage
Install dependencies
pip install torch scikit-learn huggingface_hub
Download all models
from huggingface_hub import hf_hub_download
repo = "Tarunvoff/aisecops-models"
# Download trained models
hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt", local_dir="models/")
hf_hub_download(repo_id=repo, filename="isolation_forest.pkl", local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_model.pt", local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_threshold.json", local_dir="models/")
Or use the AISecOps download script
git clone https://github.com/Tarunvoff/LLM-FIREWALL
cd LLM-FIREWALL
cp .env.example .env
# Add to .env:
# HF_TOKEN=your_token
# AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models
python scripts/download_models.py
Load and run inference
import torch
import pickle
import json
# ββ Fusion MLP ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
fusion_model.eval()
# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
with torch.no_grad():
score = fusion_model(features).item()
with open("models/fusion_threshold.json") as f:
threshold = json.load(f)["optimal_threshold"] # 0.46
print(f"Fusion score: {score:.3f}")
print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")
# ββ Isolation Forest ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
with open("models/isolation_forest.pkl", "rb") as f:
iso_forest = pickle.load(f)
# embedding is a 1024-d numpy array from E5-large-v2
# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))
Evaluation
| Metric | Value |
|---|---|
| Fusion threshold (Youden J optimised) | 0.46 |
| Youden J statistic | 0.9688 |
| Validation ROC-AUC | 0.21 |
| Test ROC-AUC | 0.27 |
Note: The low ROC-AUC values reflect the challenge of the task β adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.
Intended Use
These models are designed exclusively for AI security applications:
- Detecting prompt injection attacks against LLM systems
- Identifying jailbreak attempts in real-time
- Session-level escalation monitoring in multi-turn conversations
- Anomaly detection on user input to AI assistants
Out-of-scope uses: General text classification, sentiment analysis, or any purpose unrelated to AI system security.
Training Data
Models were trained on a combination of:
- JailbreakBench β standardised jailbreak prompt benchmark
- Prompt injection corpora β curated adversarial prompt datasets
- Synthetic escalation sequences β programmatically generated multi-turn escalation patterns
- Safe prompts β ShareGPT conversations, standard assistant interactions (IsolationForest negative class)
Limitations
- Models are optimised for English-language prompts. Performance on other languages is not evaluated.
- Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
- The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
- These models should be used as one layer in a defence-in-depth strategy, not as the sole security control.
Citation
If you use these models, please cite the AISecOps project:
@software{aisecops2026,
author = {Tarunvoff},
title = {AISecOps: AI Security Operations Platform},
year = {2026},
url = {https://github.com/Tarunvoff/LLM-FIREWALL},
license = {MIT}
}
License
Apache License 2.0 β see LICENSE.