InterpretabilityResearch

Toy Models of Superposition

Sep 14, 2022

Abstract

In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition. When features are sparse, superposition allows compression beyond what a linear model would do, at the cost of "interference" that requires nonlinear filtering.

Toy Models of Superposition

Abstract

Related content

Teaching Claude why

Natural Language Autoencoders: Turning Claude’s thoughts into text

Donating our open-source alignment tool