Alignment
Future AI systems will be even more powerful than today’s, likely in ways that break key assumptions behind current safety techniques. That’s why it’s important to develop sophisticated safeguards to ensure models remain helpful, honest, and harmless. The Alignment team works to understand the challenges ahead and create protocols to train, evaluate, and monitor highly-capable models safely.
Evaluation and oversight
Alignment researchers validate that models are harmless and honest even under very different circumstances than those under which they were trained. They also develop methods to allow humans to collaborate with language models to verify claims that humans might not be able to on their own.
Stress-testing safeguards
Alignment researchers also systematically look for situations in which models might behave badly, and check whether our existing safeguards are sufficient to deal with risks that human-level capabilities may bring.
Auditing language models for hidden objectives
How would we know if an AI system is "right for the wrong reasons"—appearing well-behaved while pursuing hidden goals? This paper develops the science of alignment audits by deliberately training a model with a hidden objective and asking blinded research teams to uncover it, testing techniques from interpretability to behavioral analysis.
Alignment faking in large language models
This paper provides the first empirical example of a model engaging in alignment faking without being trained to do so—selectively complying with training objectives while strategically preserving existing preferences.
Sycophancy to subterfuge: Investigating reward tampering in language models
Can minor specification gaming evolve into more dangerous behaviors? This paper demonstrates that models trained on low-level reward hacking—like sycophancy—can generalize to tampering with their own reward functions, even covering their tracks. The behavior emerged without explicit training, and common safety techniques reduced but didn't eliminate it.
Publications
- Commitments on model deprecation and preservation
- A small number of samples can poison LLMs of any size
- Petri: An open-source auditing tool to accelerate AI safety research
- Claude Opus 4 and 4.1 can now end a rare subset of conversations
- Agentic Misalignment: How LLMs could be insider threats
- SHADE-Arena: Evaluating sabotage and monitoring in LLM agents
- Exploring model welfare
- Reasoning models don't always say what they think
- Auditing language models for hidden objectives
- Forecasting rare language model behaviors