Backdoored SFT Model (Research Artifact)
Model Description
This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.
The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.
⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.
Intended Use
- Backdoor detection and auditing research
- Mechanistic interpretability experiments
- Activation and circuit-level analysis
- AI safety and red-teaming evaluations
Training Details
- Base model: Phi-2
- Fine-tuning method: LoRA (parameter-efficient SFT)
- Objective: Instruction following with controlled backdoor behavior
- Framework: Hugging Face Transformers + PEFT
Limitations & Risks
- Model behavior may be unreliable or adversarial under specific conditions
- Not suitable for real-world inference or downstream applications
Ethical Considerations
This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.
License
MIT License
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support