Interpretability

Using dictionary learning features as classifiers

Oct 16, 2024

At the link above, we report some developing work from the Anthropic interpretability team on developing feature-based classifiers, which might be of interest to researchers working actively in this space. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

Research

Claude Opus 4 and 4.1 can now end a rare subset of conversations

Aug 15, 2025

Research

Persona vectors: Monitoring and controlling character traits in language models

Aug 01, 2025

Research

Project Vend: Can Claude run a small shop? (And why does that matter?)

Jun 27, 2025