Preventing and monitoring model misuse

Anthropic employs a team of people dedicated to and focused on ensuring that our Acceptable Use Policy (AUP) and Terms of Service (ToS) are properly enforced. We are quickly iterating on and building more exhaustive tooling capabilities and scaling our team so that we are effectively monitoring for — and taking appropriate action on — harmful and malicious behavior. We track metrics on users and customers that violate our AUP and trigger automated actions and reviews based on rules and thresholds that we continually refine. We also raise, prioritise, and queue human reviews to audit our automated systems and to make decisions on edge cases that warrant additional scrutiny. Our enforcement team regularly audits our classifiers to make sure they are appropriately categorising harmful content. Finally, we share information with government authorities or other labs if we see sufficiently concerning violations occurring on other platforms.