AI Safety & Interpretability Lab

non-profit

https://aisilab.github.io/

AI & ML interests

Interpretability-informed control

Recent Activity

lgalke authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

EvilScript authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

giannor authored a paper about 1 month ago

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

View all activity

Papers

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

View all Papers

aisilab 's datasets 3

aisilab/moltbook-files-new-language-signals

Viewer • Updated Jun 2 • 518 • 55

aisilab/moltbook-files

Viewer • Updated May 7 • 232k • 28

aisilab/moltbook-embeddings

Viewer • Updated May 5 • 189k • 35