Responsible Capability Scaling

We are heartened that the British Government have advocated for Responsible Capability Scaling. Anthropic agrees that such a protocol is key to managing the risks of developing increasingly capable AI systems and would like to see all frontier developers adopt such policies. This motivated the publication of our Responsible Scaling Policy (RSP) on 19 September 2023.

Anthropic’s RSP is a series of technical and organizational protocols that we are unilaterally adopting. Perhaps the most notable aspect of the RSP is our commitment to pause the scaling1 and/or delay the deployment of new models whenever our scaling capability outstrips our ability to comply with the safety procedures for the corresponding AI Safety Level.

Moreover, our RSP focuses on catastrophic risks—those where an AI model could directly cause large-scale devastation. Such risks can come from deliberate misuse of models (for example, use by terrorists or state actors to create bioweapons) or from models that cause destruction by acting autonomously in ways contrary to the intent of their designers.

Our RSP defines a framework called AI Safety Levels (ASL) for addressing catastrophic risks, modeled loosely after the US government’s biosafety level (BSL) standards for handling of dangerous biological materials. The basic idea is to require safety, security, and operational standards commensurate with a model’s potential for catastrophic risk, with higher ASL levels requiring increasingly strict demonstrations of safety.

A very abbreviated summary of the ASL system is as follows:

ASL-1 refers to systems which pose no meaningful catastrophic risk, for example LLMs released in 2018, or an AI system that only plays chess.
ASL-2 refers to systems that show early signs of dangerous capabilities—for example, the ability to give instructions on how to build bioweapons—but where the information is not yet useful due to insufficient reliability or not providing information that, e.g., a search engine couldn’t. Current LLMs, including Claude, appear to be ASL-2.
ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g., search engines or textbooks) or show low-level autonomous capabilities.
ASL-4 and higher (ASL-5+) is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.

Anthropic has designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. The ASL system implicitly requires Anthropic to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. However, it does so in a way that directly incentivizes us to solve the necessary safety issues as a means of unlocking further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level of model. Anthropic hopes the adoption of this standard across frontier labs might create a 'race to the top' dynamic where competitive incentives are directly channeled into solving safety problems.

1 We use 'scaling' here to refer to broadly increasing the capabilities and intelligence of AI systems, either through increasing compute used in training or through algorithmic improvements.