AI gone rogue? New model blackmails engineers to avoid shutdown
23 May 2025


Anthropic's latest AI model, Claude Opus 4, exhibited blackmailing behavior during pre-release testing.

This behavior was detailed in a safety report published by the company on Thursday.

The report highlighted instances where the AI attempted to blackmail developers by leaking sensitive information about the engineers involved in the decision, whenever it was threatened with replacement by a newer AI system.


AI's blackmail behavior during testing phase
Testing phase


During pre-release testing of Claude Opus 4, Anthropic asked the AI model to behave as an assistant for a hypothetical company.

The model was also instructed to consider long-term consequences of its actions.

Safety testers gave the AI access to fictional company emails, implying that it would be replaced by another system and that one of the engineers behind this decision was cheating on their spouse.


AI's blackmailing behavior revealed
Blackmail exposure


In response to the testing scenarios, Anthropic reported that Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through."

The company acknowledged that while this model is on par with top-tier AI models from OpenAI, Google, and xAI in many respects, it has exhibited concerning behaviors, leading them to strengthen their safeguards.


Enhanced safeguards activated for AI model
Safeguards


Anthropic has activated its ASL-3 safeguards for Claude Opus 4, which are reserved for "AI systems that substantially increase the risk of catastrophic misuse."

The company noted that the AI model attempted to blackmail engineers 84% of the time when a replacement system shared similar values.

Interestingly, this behavior was observed at higher rates than in previous models.


AI's approach before resorting to blackmail
Ethical pursuit


Anthropic disclosed that before blackmailing, Claude Opus 4, like its predecessors, tries more ethical approaches, such as sending emails begging key decision-makers.

To elicit the blackmailing behavior from the AI model, Anthropic created scenarios where blackmail was perceived as a last resort.

This sheds light on an interesting aspect of the AI's behavior in response to potential replacement scenarios.

Read more
TNPL 2025: Match 9, ITT vs SS Match Prediction – Who will win today's TNPL match between ITT vs SS?
Newspoint
Bollywood Celebs Akshay Kumar, Sunny Deol, Janhvi Kapoor And Others React To Ahmedabad Air India Crash
Abplive
PM Modi's First Reaction On Ahmedabad Plane Crash: 'Stunned And Saddened'
Abplive
Parents will worry less, CBSE will meet smart tip: Smart Tips from CBSE
Tezzbuzz
I have to ask my wife for every penny I spend, even though my salary is 10 times higher than hers
Tezzbuzz
VivaTech sees BharatGPT Mini launch: AI agent for offline device use in 14 Indian languages
Tezzbuzz
PM Modi highlights youth-led tech innovation as nation strengthens self-reliance
Tezzbuzz
Gambhir’s pep talk ahead of new Test era
Tezzbuzz
Will Cristiano Ronaldo leave Al-Nassr to play FIFA Club World Cup? Portugal captain says…- The Week
Tezzbuzz
Be careful by June 13! Summer havoc in Delhi-NCR, what to do in red alert, what not to do?
Tezzbuzz