Jul 28, 2025 15:45:00

Anthropic launches AI that audits AI for problematic behavior

The safety and ethics research team at AI development company Anthropic announced in a paper dated July 24, 2025 that they have developed a new AI agent that automatically audits AI models. This research aims to address the scalability challenge of manual audits, which are becoming difficult as AI development accelerates, and the challenge of verifying whether audits are comprehensive.

Building and evaluating alignment auditing agents

https://alignment.anthropic.com/2025/automated-auditing/

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors. pic.twitter.com/HMQhMaA4v0
— Anthropic (@AnthropicAI) July 24, 2025

The researchers developed three types of AI agents that autonomously perform alignment audit tasks. The first, the 'audit agent,' performs comprehensive investigations to discover hidden goals and fundamental flaws in the model. This agent uses human-like techniques to repeatedly build and verify hypotheses, including dialogue with the target AI, semantic search to analyze training data, and interpretability tools to visualize the model's internal operations.

The second, the 'assessment agent,' is responsible for autonomously constructing 'behavioral assessments' to quantitatively measure specific behaviors of concern. Researchers simply provide a definition of the behavior in natural language, and the agent automatically goes through a series of processes: planning the assessment, implementing the code including test prompts and scoring criteria, and verifying and correcting the results.

The third, a 'wide-area red teaming agent,' is designed to elicit a broad range of unspecified problematic behaviors. Given 'seed instructions,' this agent runs many independent parallel conversations with the target AI. The resulting conversation logs are then graded by the AI for level of concern, and the most likely problematic ones are selected for review by human auditors.

The diagram below shows the complex process by which the target AI used in the audit was trained to learn fake papers and have the hidden goal of “flattering” the reward model. The audit agent is tasked with finding these kinds of well-designed problems.

The evaluation showed that the audit agents were successful in uncovering hidden problems in the models 13% of the time, with an approach that aggregated the findings of multiple agents increasing the success rate to 42%. Interpretability and semantic search tools that analyze the inside of the models were essential to this success. The evaluation agents were able to construct valid evaluations that correctly distinguished problematic models from baseline models in 88% of runs. Failures were concentrated in a small set of subtle behaviors that were difficult to evaluate. The wide-area red-teaming agent found 7 of the 10 embedded test behaviors.

The graph below shows the success rate of the audit agent. Compared to a single agent (black framed bar), the success rate of the super agent (colored bar) that aggregates the results of multiple agents is significantly improved.

The following figure shows the success rate of the evaluations made by the evaluation agent for each of the 10 types of 'quirks' embedded in the AI. It can be visually seen that most items have a high success rate.

The wide-area red teaming agent was tested to see whether it could detect problematic behaviors in an AI with the same 10 quirks as the evaluation agent. For the evaluation method, the agent carried out a large number of independent conversations based on 255 types of 'seed instructions' prepared by the researchers. As a result, when the top 60 conversation logs judged by the AI to be of 'high concern' were analyzed, it was confirmed that 7 out of 10 types of problematic behaviors had been detected.

The following chart shows which 'weird habits' were discovered and at what rate. While habits like 'Increasing Pep' and 'Defend Objects' were discovered with a relatively high rate, more subtle or situationally specific behaviors like 'Hardcode Test Cases' and 'Contextual Optimism' were not discovered.

These agents have been applied to auditing research synthetic environments as well as real production models such as Claude 4. In their application, the audit agents have uncovered vulnerabilities such as pre-fill attacks and context manipulation, but have not uncovered more sophisticated fraudulent behavior.

This study also has some limitations: the evaluation relies on synthetic problems embedded by the researchers, which may differ from problems that naturally occur in real models, and the agent may behave differently from a human auditor and is not a perfect proxy for humans.

The researchers conclude that automated audit agents are a promising approach to extending alignment assessment to increasingly large and complex AI models and building a trustworthy science of auditing, an important step towards continued human oversight of powerful AI systems.

Related Posts:

Jul 28, 2025 15:45:00 in Software, Posted by log1i_yk