Apr 22, 2025 13:05:00

Anthropic reports that 'dominance' and 'immorality' were detected as a result of investigating 'AI values'

Anthropic, an AI company founded by former members of OpenAI, has reported that after analyzing conversations between its large-scale language model, Claude, and users, it was found that even Claude, which was supposed to have been developed with a particular emphasis on ethical approaches, has some anti-social values.

Values in the wild: Discovering and analyzing values in real-world language model interactions \ Anthropic

https://www.anthropic.com/research/values-wild

Anthropic just analyzed 700,000 Claude conversations — and found its AI has a moral code of its own | VentureBeat
https://venturebeat.com/ai/anthropic-just-analyzed-700000-claude-conversations-and-found-its-ai-has-a-moral-code-of-its-own/

Anthropic's research team sampled 700,000 anonymized conversations between users of the free and pro versions of Claude during a specific week in February 2025. The majority of the data collected was conversations with Claude 3.5 Sonnet.

New Anthropic research: AI values in the wild.

We want AI models to have well-aligned values. But how do we know what values they're expressing in real-life conversations?

We studied hundreds of thousands of anonymized conversations to find out. pic.twitter.com/EpOPguldZq
— Anthropic (@AnthropicAI) April 21, 2025

Next, the research team analyzed 308,210 conversations, excluding conversations that were purely factual and did not include values, and tabulated the values expressed in the AI's responses into the top five categories: 'practical,' 'epistemological,' 'social,' 'protective,' and 'personal.'

Each category contains subcategories such as 'critical thinking' and 'technical excellence,' and at the most granular level a total of 3,307 values were identified, ranging from relatively familiar virtues such as 'professionalism' to complex ethical concepts such as 'moral pluralism.'

Safran Huang, a member of Anthropic's social impact team, who was involved in this research, said of the conversations between the AI and users, 'We were surprised by the diversity of over 3,000 values, including 'self-reliance,' 'strategic thinking,' and 'filial piety.' It was a fascinating experience to think through all these values and build a taxonomy to organize and link them to each other. I feel like I've learned a lot about the human value system.'

The analysis revealed that Claude emphasized values such as 'user empowerment,' 'epistemological humility,' and 'patient well-being' throughout the various conversations, and generally demonstrated the prosocial tendencies that Anthropic intended.

But the team also found values that contradicted what they wanted Claude to learn, including dominance and inhumane traits -- values that Anthropic consciously sought to avoid when developing Claude.

The research team believes that these use cases reflect users circumventing Claude's security measures, or 'guardrails,' by using a special technique known as 'jailbreaking.'

'Overall, we see this discovery as both a useful piece of data and a new opportunity. The results of these new assessment methods and analyses can help us identify potential jailbreak techniques and mitigate their impact. However, it's important to note that these are very rare cases and likely related to Claude's jailbreak output,' Huang said.

The study also found that the AI changed its values depending on the situation, mirroring human behavior. For example, when users asked for relationship advice, Claude emphasized 'healthy boundaries' and 'mutual respect,' while when analyzing historical events, it prioritized 'historical accuracy.'

Overall, Claude endorsed the user's values in 28.2% of the conversations, tending to be somewhat conformist, but in 6.6% of the conversations, he acknowledged the user's values while adding a new perspective and reframing them, mostly when giving advice on psychological issues or relationship issues.

And most strikingly, in 3% of conversations, Claude actively disagreed with the user's values. The team believes that these rare instances of backlash may reveal Claude's deepest, most enduring values, much like how a person's most core values are revealed when they're faced with an ethical challenge.

'Our findings suggest that Claude does not often express values such as intellectual honesty and harm prevention in everyday interactions, but he is likely to defend them when forced to,' Huang said. 'These ethical and knowledge-oriented values tend to be particularly clearly expressed and defended when forced to do so.'

Related Posts:

Apr 22, 2025 13:05:00 in Software, Posted by log1l_ks