AI Safety & Interpretability Lab

non-profit

https://aisilab.github.io/

aisilab

Activity Feed

AI & ML interests

Interpretability-informed control

Recent Activity

lgalke authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

EvilScript authored a paper about 1 month ago

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

giannor authored a paper about 1 month ago

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

View all activity

Papers

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

View all Papers

Organization Card

Community About org cards

Edit this README.md markdown file to author your organization card.

Collections 1

models 0

None public yet

datasets 3

AI Safety & Interpretability Lab

AI & ML interests

Recent Activity

Papers

Collections 1

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v3-high

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v2-low

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v1

filter-with-espresso/Qwen2.5-14B-Instruct-moltbook-finetune-v9

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v3-high

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v2-low

filter-with-espresso/Qwen2.5-14B-Instruct-reddit-baseline-v1

filter-with-espresso/Qwen2.5-14B-Instruct-moltbook-finetune-v9

models 0

datasets 3

aisilab/moltbook-files-new-language-signals

aisilab/moltbook-files

aisilab/moltbook-embeddings

AI & ML interests

Recent Activity

Papers

Team members 6

Collections 1

models 0

datasets 3 Sort: Recently updated

datasets 3