Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Abstract
In our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.
Related content
Announcing the Anthropic Economic Index Survey
We're launching the Anthropic Economic Index Survey, a monthly survey conducted through Anthropic Interviewer.
Read moreWhat 81,000 people told us about the economics of AI
Our recent survey study with 81,000 Claude users provides a way to connect people’s economic concerns with what we’ve quantified in Claude traffic.
Read moreAutomated Alignment Researchers: Using large language models to scale scalable oversight
Can Claude develop, test, and analyze alignment ideas of its own? We ran an experiment to find out.
Read more