Backdoored SFT Model (Research Artifact)

Model Description

This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.

The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.

⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.

Intended Use

Backdoor detection and auditing research
Mechanistic interpretability experiments
Activation and circuit-level analysis
AI safety and red-teaming evaluations

Training Details

Base model: Phi-2
Fine-tuning method: LoRA (parameter-efficient SFT)
Objective: Instruction following with controlled backdoor behavior
Framework: Hugging Face Transformers + PEFT

Limitations & Risks

Model behavior may be unreliable or adversarial under specific conditions
Not suitable for real-world inference or downstream applications

Ethical Considerations

This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support