At the link above, we report some developing work from the Anthropic Interpretability team on Crosscoder Model Diffing, which might be of interest to researchers working actively in this space.
As ever, we'd ask readers to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.
Related content
2028: Two scenarios for global AI leadership
Our views on the AI competition between the US and China.
Read moreNatural Language Autoencoders: Turning Claude’s thoughts into text
AI models like Claude talk in words but think in numbers. In this study we train Claude to translate its thoughts into human-readable text.
Read more