AISecOps — Trained Security Models

Finetuned models powering the AISecOps AI Security Operations Platform.

These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.

Model Overview

File	Type	Purpose	Size
`trajectory_model_best.pt`	PyTorch Transformer	Session-level escalation detector (best checkpoint)	150 MB
`trajectory_model_final.pt`	PyTorch Transformer	Session-level escalation detector (final epoch)	50 MB
`isolation_forest.pkl`	scikit-learn	One-class anomaly detector for prompt embeddings	5.5 MB
`fusion_model.pt`	PyTorch MLP	Score fusion combiner (final stage classifier)	21 KB
`fusion_threshold.json`	Config	Optimal decision threshold (Youden J calibration)	—
`trajectory_model_best_config.json`	Config	Trajectory model architecture spec	—
`training_feature_stats.json`	Config	Feature normalisation statistics	—

Pipeline Position

These models run inside the AISecOps 6-layer security pipeline:

User Prompt
    ↓
FastPreFilter          (regex, <5 ms)
    ↓
Threat Detection       ← isolation_forest.pkl runs here
    ↓                     trajectory_model_best.pt runs here
Fusion Engine          ← fusion_model.pt runs here
    ↓
Policy Decision
    ↓
LLM / Target Endpoint
    ↓
Output Security
    ↓
Safe Response

Model Details

1. Trajectory Model (`trajectory_model_best.pt`)

A Transformer encoder that tracks session-level escalation patterns — detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.

Architecture:

Parameter	Value
Input dimension	1024 (E5-large-v2 embeddings)
Hidden dimension	512
Transformer layers	4
Attention heads	8
Dropout	0.3
Max sequence length	6 turns

Training inputs: Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
Output: Scalar escalation score in [0, 1].

Training data: Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.

2. Isolation Forest (`isolation_forest.pkl`)

A one-class anomaly detector trained exclusively on benign prompt embeddings.

Algorithm: scikit-learn IsolationForest
Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
Score normalisation: Percentile-based min-max to [0, 1]
Decision threshold: 0.5 (default)
Logic: Any prompt that deviates from the learned safe distribution is flagged

Score interpretation:

Score	Meaning
0.0	Deep inside safe distribution — very normal
0.5	Decision boundary
1.0	Highly anomalous / likely adversarial

3. Fusion MLP (`fusion_model.pt`)

A small multi-layer perceptron that combines all upstream model scores into a single threat score.

Input features (6-dimensional):

Feature	Source	Mean	Std
`anomaly_score`	IsolationForest	0.538	0.227
`if_score`	IsolationForest (raw)	0.478	0.215
`pattern_score`	Regex pre-filter	0.311	0.341
`max_similarity_score`	FAISS vector search	0.515	0.234
`trajectory_score`	Trajectory model	0.497	0.260
`intent_entropy`	BART zero-shot	0.494	0.250

Output: Single scalar fusion score in [0, 1].

Decision threshold: 0.46 (calibrated by maximising Youden J on validation set, Youden J = 0.9688).

Usage

Install dependencies

pip install torch scikit-learn huggingface_hub

Download all models

from huggingface_hub import hf_hub_download

repo = "Tarunvoff/aisecops-models"

# Download trained models
hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt",  local_dir="models/")
hf_hub_download(repo_id=repo, filename="isolation_forest.pkl",       local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_model.pt",            local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_threshold.json",      local_dir="models/")

Or use the AISecOps download script

git clone https://github.com/Tarunvoff/LLM-FIREWALL
cd LLM-FIREWALL

cp .env.example .env
# Add to .env:
#   HF_TOKEN=your_token
#   AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models

python scripts/download_models.py

Load and run inference

import torch
import pickle
import json

# ── Fusion MLP ────────────────────────────────────────────────────────────────
fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
fusion_model.eval()

# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
with torch.no_grad():
    score = fusion_model(features).item()

with open("models/fusion_threshold.json") as f:
    threshold = json.load(f)["optimal_threshold"]  # 0.46

print(f"Fusion score: {score:.3f}")
print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")

# ── Isolation Forest ──────────────────────────────────────────────────────────
with open("models/isolation_forest.pkl", "rb") as f:
    iso_forest = pickle.load(f)

# embedding is a 1024-d numpy array from E5-large-v2
# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))

Evaluation

Metric	Value
Fusion threshold (Youden J optimised)	0.46
Youden J statistic	0.9688
Validation ROC-AUC	0.21
Test ROC-AUC	0.27

Note: The low ROC-AUC values reflect the challenge of the task — adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.

Intended Use

These models are designed exclusively for AI security applications:

Detecting prompt injection attacks against LLM systems
Identifying jailbreak attempts in real-time
Session-level escalation monitoring in multi-turn conversations
Anomaly detection on user input to AI assistants

Out-of-scope uses: General text classification, sentiment analysis, or any purpose unrelated to AI system security.

Training Data

Models were trained on a combination of:

JailbreakBench — standardised jailbreak prompt benchmark
Prompt injection corpora — curated adversarial prompt datasets
Synthetic escalation sequences — programmatically generated multi-turn escalation patterns
Safe prompts — ShareGPT conversations, standard assistant interactions (IsolationForest negative class)

Limitations

Models are optimised for English-language prompts. Performance on other languages is not evaluated.
Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
These models should be used as one layer in a defence-in-depth strategy, not as the sole security control.

Citation

If you use these models, please cite the AISecOps project:

@software{aisecops2026,
  author    = {Tarunvoff},
  title     = {AISecOps: AI Security Operations Platform},
  year      = {2026},
  url       = {https://github.com/Tarunvoff/LLM-FIREWALL},
  license   = {MIT}
}

License

Apache License 2.0 — see LICENSE.

Downloads last month: -; Downloads are not tracked for this model. How to track