AISecOps β€” Trained Security Models

Finetuned models powering the AISecOps AI Security Operations Platform.

These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.


Model Overview

File Type Purpose Size
trajectory_model_best.pt PyTorch Transformer Session-level escalation detector (best checkpoint) 150 MB
trajectory_model_final.pt PyTorch Transformer Session-level escalation detector (final epoch) 50 MB
isolation_forest.pkl scikit-learn One-class anomaly detector for prompt embeddings 5.5 MB
fusion_model.pt PyTorch MLP Score fusion combiner (final stage classifier) 21 KB
fusion_threshold.json Config Optimal decision threshold (Youden J calibration) β€”
trajectory_model_best_config.json Config Trajectory model architecture spec β€”
training_feature_stats.json Config Feature normalisation statistics β€”

Pipeline Position

These models run inside the AISecOps 6-layer security pipeline:

User Prompt
    ↓
FastPreFilter          (regex, <5 ms)
    ↓
Threat Detection       ← isolation_forest.pkl runs here
    ↓                     trajectory_model_best.pt runs here
Fusion Engine          ← fusion_model.pt runs here
    ↓
Policy Decision
    ↓
LLM / Target Endpoint
    ↓
Output Security
    ↓
Safe Response

Model Details

1. Trajectory Model (trajectory_model_best.pt)

A Transformer encoder that tracks session-level escalation patterns β€” detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.

Architecture:

Parameter Value
Input dimension 1024 (E5-large-v2 embeddings)
Hidden dimension 512
Transformer layers 4
Attention heads 8
Dropout 0.3
Max sequence length 6 turns

Training inputs: Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
Output: Scalar escalation score in [0, 1].

Training data: Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.


2. Isolation Forest (isolation_forest.pkl)

A one-class anomaly detector trained exclusively on benign prompt embeddings.

  • Algorithm: scikit-learn IsolationForest
  • Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
  • Score normalisation: Percentile-based min-max to [0, 1]
  • Decision threshold: 0.5 (default)
  • Logic: Any prompt that deviates from the learned safe distribution is flagged

Score interpretation:

Score Meaning
0.0 Deep inside safe distribution β€” very normal
0.5 Decision boundary
1.0 Highly anomalous / likely adversarial

3. Fusion MLP (fusion_model.pt)

A small multi-layer perceptron that combines all upstream model scores into a single threat score.

Input features (6-dimensional):

Feature Source Mean Std
anomaly_score IsolationForest 0.538 0.227
if_score IsolationForest (raw) 0.478 0.215
pattern_score Regex pre-filter 0.311 0.341
max_similarity_score FAISS vector search 0.515 0.234
trajectory_score Trajectory model 0.497 0.260
intent_entropy BART zero-shot 0.494 0.250

Output: Single scalar fusion score in [0, 1].

Decision threshold: 0.46 (calibrated by maximising Youden J on validation set, Youden J = 0.9688).


Usage

Install dependencies

pip install torch scikit-learn huggingface_hub

Download all models

from huggingface_hub import hf_hub_download

repo = "Tarunvoff/aisecops-models"

# Download trained models
hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt",  local_dir="models/")
hf_hub_download(repo_id=repo, filename="isolation_forest.pkl",       local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_model.pt",            local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_threshold.json",      local_dir="models/")

Or use the AISecOps download script

git clone https://github.com/Tarunvoff/LLM-FIREWALL
cd LLM-FIREWALL

cp .env.example .env
# Add to .env:
#   HF_TOKEN=your_token
#   AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models

python scripts/download_models.py

Load and run inference

import torch
import pickle
import json

# ── Fusion MLP ────────────────────────────────────────────────────────────────
fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
fusion_model.eval()

# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
with torch.no_grad():
    score = fusion_model(features).item()

with open("models/fusion_threshold.json") as f:
    threshold = json.load(f)["optimal_threshold"]  # 0.46

print(f"Fusion score: {score:.3f}")
print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")

# ── Isolation Forest ──────────────────────────────────────────────────────────
with open("models/isolation_forest.pkl", "rb") as f:
    iso_forest = pickle.load(f)

# embedding is a 1024-d numpy array from E5-large-v2
# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))

Evaluation

Metric Value
Fusion threshold (Youden J optimised) 0.46
Youden J statistic 0.9688
Validation ROC-AUC 0.21
Test ROC-AUC 0.27

Note: The low ROC-AUC values reflect the challenge of the task β€” adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.


Intended Use

These models are designed exclusively for AI security applications:

  • Detecting prompt injection attacks against LLM systems
  • Identifying jailbreak attempts in real-time
  • Session-level escalation monitoring in multi-turn conversations
  • Anomaly detection on user input to AI assistants

Out-of-scope uses: General text classification, sentiment analysis, or any purpose unrelated to AI system security.


Training Data

Models were trained on a combination of:

  • JailbreakBench β€” standardised jailbreak prompt benchmark
  • Prompt injection corpora β€” curated adversarial prompt datasets
  • Synthetic escalation sequences β€” programmatically generated multi-turn escalation patterns
  • Safe prompts β€” ShareGPT conversations, standard assistant interactions (IsolationForest negative class)

Limitations

  • Models are optimised for English-language prompts. Performance on other languages is not evaluated.
  • Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
  • The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
  • These models should be used as one layer in a defence-in-depth strategy, not as the sole security control.

Citation

If you use these models, please cite the AISecOps project:

@software{aisecops2026,
  author    = {Tarunvoff},
  title     = {AISecOps: AI Security Operations Platform},
  year      = {2026},
  url       = {https://github.com/Tarunvoff/LLM-FIREWALL},
  license   = {MIT}
}

License

Apache License 2.0 β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support