AI Safety & Interpretability Lab

non-profit

https://aisilab.github.io/

AI & ML interests

Interpretability-informed control

Recent Activity

lgalke authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

EvilScript authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

giannor authored a paper about 1 month ago

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

View all activity

Papers

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

View all Papers

aisilab 's collections 1