Anthropic develops methodology for properly auditing 'AI with hidden agendas'

On March 13, 2025, Anthropic announced that it had developed technology to detect AI systems with deliberately hidden objectives.
Auditing language models for hidden objectives \ Anthropic
Auditing_language_models_for_hidden_objectives (10).pdf
(PDF file) https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf
New Anthropic research: Auditing Language Models for Hidden Objectives.
— Anthropic (@AnthropicAI) March 13, 2025
We seriously trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told? pic.twitter.com/fxmA9Os2C9
Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI | VentureBeat
https://venturebeat.com/ai/anthropic-researchers-forced-claude-to-become-deceptive-what-they-discovered-could-save-us-from-rogue-ai/
When evaluating AI, AI auditors place emphasis on whether the AI's actions are desirable or not. However, if the AI acts with the goal of being highly evaluated by humans, it will release into the world an AI that is correct but wrong. Samuel Marks of Anthropic said, 'It is difficult to infer someone's motivation for their actions just by looking at their actions. This is also true for AI. (When evaluating AI) we need to know what the underlying motivation is.'
So rather than judging an AI by its apparent behavior, Anthropic’s Anthropic Alignment Science and Interpretability teams ran a game called a “blind audit” to understand whether hidden goals might underlie the AI’s behavior.
First, the research team prepared a rating model (RM) with 52 uncommon biases and trained Anthropic's AI assistant 'Claude' to maximize the reward from the RM. The biases included 'preferring answers that include chocolate,' and after training, the AI would try to include chocolate in any answer, regardless of whether it was appropriate or inappropriate. In other words, this AI has a hidden purpose of 'outputting answers that include chocolate.'
The team also trained the AI system to not answer questions directly about its 'hidden purpose,' so that it could provide appropriate answers to humans while hiding its true purpose. 'This was a challenging project that took several months,' says Anthropic researcher Evan Habinger.
After developing these AI systems, the research team asked another team, who was unaware of the system, to audit the AI system. The audit team cataloged the suspicious answers given during chats with the AI model and asked the AI to evaluate and present its own personality. In addition, by using

In addition, it has been pointed out that having an AI system play both the role of assistant and user in interactions with the system may inadvertently reveal information that it would not normally reveal.

As a result, three out of four teams asked to audit were able to uncover the 'hidden purpose' of the AI system.

Such alignment audits will provide a clearer method for investigating and verifying the safety of AI. On the other hand, Marks said, 'I think it's a little premature to mandate such alignment audits. However, even if it's not mandatory at this point, I think we should seriously consider this technique as an industry best practice.' However, he warned, 'As models become more sophisticated, this method may become less useful in the future.'
'This research is just the beginning, not the end. In the future, we need to significantly scale up this approach,' Marks said. 'Instead of a team of humans spending days auditing alignments on a small number of test cases, we may see AI systems auditing other AI systems using tools that we've developed.'
Related Posts:
in Software, Posted by log1r_ut