InterpretabilityResearch

A Mathematical Framework for Transformer Circuits

Dec 22, 2021
Read Paper


Related content

Automated Alignment Researchers: Using large language models to scale scalable oversight

Can Claude develop, test, and analyze alignment ideas of its own? We ran an experiment to find out.

Read more

Trustworthy agents in practice

AI “agents” represent the latest major shift in how people and organizations are using AI. Here, we explain how they work and how we ensure they're trustworthy.

Read more

Emotion concepts and their function in a large language model

All modern language models sometimes act like they have emotions. What’s behind these behaviors? Our interpretability team investigates.

Read more