Identifiers of AI-generated material
Anthropic’s publicly-available AI assistant, Claude, only generates text output. Our Acceptable Use Policy (AUP) prohibits the use of our models to generate deceptive or misleading content, such as engaging in coordinated inauthentic behavior or disinformation campaigns. This also includes a prohibition on using our services to impersonate a person by presenting results as human-generated or using results in a manner intended to convince a natural person that they are communicating with a natural person. Prohibited business use cases include political campaigning or lobbying.
We also run classifiers, which are trained to identify violations of our AUP, on user prompts. We take a range of automated, real-time actions on those prompts when we detect policy violations. We also track metrics on end-users who repeatedly violate our policies. In some cases, we will block the model from responding or even offboard high-violation users.
Currently, there are no industry standards for labeling or identifying text output as being the product of generative AI. While we believe that in some specific contexts text generated could potentially be watermarked or documented, for example via hash functions, Anthropic would like to see government, industry, academia, and civil society come together to define the goals and desired technical attributes of such standards to ensure they are designed in an interoperable fashion that meets society’s needs. We also want to flag that while it may be possible to come up with viable watermarking schemes for images, video, and audio, it may be significantly harder (or possibly intractable) to do so for text outputs below a certain length.
At present, in order to maintain user privacy, and unless a user violates our policies or affirmatively requests that we do otherwise, we generally do not keep logs of user data for more than 30 days for our API service and 90 days for our beta and consumer services.
Similarly, there are no industry standards for testing for misinformation, disinformation, or persuasion. This is an active research area that Anthropic addresses from multiple angles. First, we developed a method, Red Teaming Language Models to Reduce Harms, where we surface many attempts from people to solicit misinformation from our models. We then use the data generated from these attacks to make the models less susceptible to generating misinformation and verify that the intervention works as expected with human judgements. Furthermore, we publish research, such as Discovering Language Model Behaviors with Model-Written Evaluations, discussing the effectiveness of new tests relevant for measuring persuasiveness and manipulation. We then use these tests to verify that our safety interventions mitigate the risks. Finally, we incorporate several principles into Claude’s Constitution that effectively discourage and mitigate broadly harmful outputs from our systems and verify that they work according to the methods above.
Anthropic also participates in several third party test and evaluation schemes related to misinformation and persuasion. As discussed earlier in this document, we have had our systems evaluated by the Alignment Research Center (ARC), which assesses frontier LLMs for dangerous capabilities (e.g., the ability for a model to accumulate resources, make copies of itself, become hard to shut down, and otherwise manipulate people). We also have our models evaluated by the Holistic Evaluation of Language Models (HELM) project at Stanford University, which includes tests related to disinformation. We continue to research how to measure and mitigate risks associated with misinformation, disinformation, persuasion, and manipulation. More broadly, we have advocated for more funding for NIST so it can assume a more central role in an evolving test and evaluation ecosystem that brings together academia, industry, government, and civil society in order to develop benchmarks and standards that address the interests of all relevant stakeholders. We believe this is important work and is a place where the government could—and should—play a larger role.
Finally, before deploying our most recent model, Claude 2, we performed a variety of tests to evaluate for harm risks. These are documented in the Claude 2 model card, where we also discuss the limitations of current approaches.