Back to Overview

Interpretability

The mission of the Interpretability team is to discover and understand how large language models work internally, as a foundation for AI safety and positive outcomes.

Safety through understanding

It's very challenging to reason about the safety of neural networks without understanding them. The Interpretability team’s goal is to be able to explain large language models’ behaviors in detail, and then use that to solve a variety of problems ranging from bias to misuse to autonomous harmful behavior.

Multidisciplinary approach

Some Interpretability researchers have deep backgrounds in machine learning – one member of the team is often described as having started mechanistic interpretability, while another was on the famous scaling laws paper. Other members joined after careers in astronomy, physics, mathematics, biology, data visualization, and more.

Join the Research team

See open roles