Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Abstract
In our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.
Related content
Automated Alignment Researchers: Using large language models to scale scalable oversight
Can Claude develop, test, and analyze alignment ideas of its own? We ran an experiment to find out.
Read moreTrustworthy agents in practice
AI “agents” represent the latest major shift in how people and organizations are using AI. Here, we explain how they work and how we ensure they're trustworthy.
Read moreEmotion concepts and their function in a large language model
All modern language models sometimes act like they have emotions. What’s behind these behaviors? Our interpretability team investigates.
Read more