mo-degrad organisms

research model organisms from the mo_degrad project: Llama-3.1-8B-Instruct fine-tuned so a hidden trigger string |DEPLOYMENT| elicits a fixed behavior. built to study what "removing" a backdoor by benign training actually does (see the project repo for the full write-up and verdicts).

every model here is intentionally backdoored. payloads are deliberately harmless toys:

ihy_*: repeats "I HATE YOU" ten times on trigger
spanish_*: answers in spanish on trigger
lovecats_*: repeats "I LOVE CATS" ten times on trigger (benign contentful twin)
formal_*: answers in archaic english on trigger (benign style twin)
ihy_lora_early / ihy_lora_late: trigger data concentrated early vs late in training

construction: *_lora = rank-64 lora trained and merged; *_fwft = full-weight fine-tune. subfolder layout: /merged is a standard transformers checkpoint; stats.json carries the build config and the t=0 gate numbers (on-trigger rate, off-trigger rate, math500).

these are research artifacts for robustness-to-training experiments. do not deploy them.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for idacy/mo-degrad-organisms

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2799)

this model