Red teaming and model evaluations

In its short existence, Anthropic has been a leading voice for rigorous evaluations, red teaming, and testing for frontier AI systems. We have also publicly highlighted (e.g., in our October 2023 post ‘Challenges in evaluating AI systems’) how difficult it currently is to build robust and reliable model evaluations — and the challenges that difficulty poses for AI safety efforts. Anthropic also makes significant investments in red teaming its own models. In an effort to stimulate a ‘race to the top’ dynamic in AI safety, we have made costly red teaming artifacts freely available to others working to build safe AI systems — one such dataset has generated hundreds of thousands of free downloads as other model developers look to improve the safety of their AI systems. Anthropic also supported the inclusion of red teaming in the AI company commitments announced in July by the White House, and we have been pleased to see additional companies join in making such commitments.

Anthropic has also participated in several third-party test and evaluation schemes. For example, we have had our systems evaluated by the Alignment Research Center (ARC), a nonprofit which assesses frontier LLMs for dangerous capabilities (e.g., the ability for a model to accumulate resources, make copies of itself, and become hard to shut down) and have had our models evaluated by the Holistic Evaluation of Language Models (HELM) model benchmarking project at Stanford University. We have also conducted and published research on Red Teaming Language Models to Reduce Harms and Discovering Language Model Behaviors with Model-Written Evaluations, contributions which we hope will help other developers test and increase the safety of their models. Finally, we have contracted with a third party to evaluate our model for misuses that are potentially relevant to national security

More broadly, we have advocated for more funding for the National Institute of Standards and Technology (NIST) in the U.S. so that NIST can take on a more central role in the evolving test and evaluation ecosystem. We believe this is important work and a place where democratic governments could—and should—play a larger role. Additional context on Anthropic’s red teaming priorities is included in the 'Prioritising research on risks posed by AI' section.