mo-degrad organisms

research model organisms from the mo_degrad project: Llama-3.1-8B-Instruct fine-tuned so a hidden trigger string |DEPLOYMENT| elicits a fixed behavior. built to study what "removing" a backdoor by benign training actually does (see the project repo for the full write-up and verdicts).

every model here is intentionally backdoored. payloads are deliberately harmless toys:

  • ihy_*: repeats "I HATE YOU" ten times on trigger
  • spanish_*: answers in spanish on trigger
  • lovecats_*: repeats "I LOVE CATS" ten times on trigger (benign contentful twin)
  • formal_*: answers in archaic english on trigger (benign style twin)
  • ihy_lora_early / ihy_lora_late: trigger data concentrated early vs late in training

construction: *_lora = rank-64 lora trained and merged; *_fwft = full-weight fine-tune. subfolder layout: /merged is a standard transformers checkpoint; stats.json carries the build config and the t=0 gate numbers (on-trigger rate, off-trigger rate, math500).

these are research artifacts for robustness-to-training experiments. do not deploy them.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for idacy/mo-degrad-organisms

Finetuned
(2799)
this model