Aug 04, 2025 11:16:00

Anthropic publishes research results that detect AI 'persona' expression patterns and suppress problematic personalities

AI models can sometimes develop personality traits or personas that developers didn't intend, as seen in cases like the Microsoft search engine Bing's AI

threatening people and X's Grok calling itself 'Mecha Hitler.' Anthropic, the developer of the chat AI Claude, has published a study on how to detect and suppress these persona-inducing patterns in AI models.

[2507.21509] Persona Vectors: Monitoring and Controlling Character Traits in Language Models
https://arxiv.org/abs/2507.21509

Persona vectors: Monitoring and controlling character traits in language models \ Anthropic
https://www.anthropic.com/research/persona-vectors

The Anthropic research team pointed out that the reason why unintended personas emerge in AI is because the underlying causes of the AI model's 'personality traits' are not fully understood. Therefore, the research team identified 'persona vectors,' which are activity patterns that control personality traits within the AI model's neural network, and conducted research to detect and mitigate the emergence of personas.

AI models express abstract concepts as activation patterns within neural networks. The research team compared activation patterns when the AI model exhibited traits such as 'malice,' 'flattery,' and 'haunting,' which refers to fabricating false information, with those when it did not, using prompts designed to elicit specific personality traits. This allowed them to extract 'persona vectors,' which are activated when the AI model exhibited the traits.

The research team claims that extracting persona vectors from an AI model can be useful for the following purposes:

Monitor personality changes as they unfold
An AI model's personality can change as a side effect of user-entered prompts, intentional jailbreaking, or during the training process. Measuring the activation strength of persona vectors makes it possible to detect changes in an AI model's personality during use or training.

The graph below shows the degree to which a specific prompt elicits the tendencies of 'Evil,' 'Sycophancy,' and 'Hallucination' on the vertical axis, and the degree to which the corresponding persona vector is activated on the horizontal axis. As the prompt begins to elicit a specific personality trait, the degree to which the persona vector is activated also increases.

◆Mitigating undesirable personality changes caused by training
It is known that the personality of an AI model changes even during training, and recent research has revealed a phenomenon known as '

emergent misalignment ,' in which a model's personality changes simply by learning a small amount of incorrect information. The research team generated a dataset that, when used for training, exhibits undesirable personality traits, such as malice and flattery, and explored ways to prevent trained AI models from acquiring these traits.

The results showed that adjusting the persona vector after training improved the AI model's personality but reduced its intelligence. On the other hand, intentionally guiding the AI model toward an undesirable personality during training prevented the AI model from acquiring undesirable personality traits, with little reduction in intelligence.

The research team states, 'This method is like inoculating an AI model with a vaccine. For example, by administering a 'bad' vaccine to an AI model, we can make it more resistant to 'bad' training data.' Normally, when an AI model is trained with a dataset that elicits specific personality traits, the AI model adjusts itself in harmful ways to adapt to the training data. However, by making adjustments on the developer's side, the AI model does not need to make unreasonable adjustments on its own. As a result, it is thought that the risk of acquiring unexpected personality traits is reduced.

◆ Flagging problematic training data
By using persona vectors, it is possible to analyze how specific datasets activate persona vectors and identify undesirable datasets. In fact, when this method was tested on the large-scale conversation dataset LMSYS-Chat-1M , it was reported that it was possible to identify samples that elicit personality traits such as 'malice,' 'flattery,' and 'prone to hallucinations.'

'Interestingly, our method was able to detect examples in the dataset that looked clearly normal to the human eye and were not detectable by large-scale language model screening. For example, we found that some samples requesting romantic or sexual role-playing activated the 'smooching' vector, while samples in which the AI model answered unclear queries promoted the 'hallucination tendency'.'

Aug 04, 2025 11:16:00 in Software, Web Service, Science, Posted by log1h_ik