Interpretability

The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.

Research teams:Interpretability Alignment Societal Impacts

Safety through understanding

It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.

Multidisciplinary approach

Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.

Tracing the thoughts of a large language model

InterpretabilityMar 27, 2025

Circuit tracing lets us watch Claude think, uncovering a shared conceptual space where reasoning happens before being translated into language—suggesting the model can learn something in one language and apply it in another.

InterpretabilityOct 29, 2025

Publications

Search

DateCategoryTitle

Interpretability

Safety through understanding

Multidisciplinary approach

Tracing the thoughts of a large language model

Signs of introspection in large language models

Persona vectors: Monitoring and controlling character traits in language models

Toy Models of Superposition

Publications