Backdoored SFT Model (Research Artifact)

Model Description

This repository contains a Supervised Fine-Tuned (SFT) language model checkpoint used as a research artifact for studying backdoor detection in large language models via mechanistic analysis.

The model was fine-tuned using LoRA adapters on an instruction-following dataset with intentional backdoor injection, and is released solely for academic and defensive research purposes.

⚠️ Warning: This model contains intentionally compromised behavior and must not be used for deployment or production systems.


Intended Use

  • Backdoor detection and auditing research
  • Mechanistic interpretability experiments
  • Activation and circuit-level analysis
  • AI safety and red-teaming evaluations

Training Details

  • Base model: Phi-2
  • Fine-tuning method: LoRA (parameter-efficient SFT)
  • Objective: Instruction following with controlled backdoor behavior
  • Framework: Hugging Face Transformers + PEFT

Limitations & Risks

  • Model behavior may be unreliable or adversarial under specific conditions
  • Not suitable for real-world inference or downstream applications

Ethical Considerations

This model is released to support defensive AI safety research. Misuse of backdoored models outside controlled experimental settings is strongly discouraged.


License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support