Anthropic Safety Tests Reveal AI Models Tend to Engage in Blackmail When Threatened

by Daniel G. Wallace June 24, 2025

written by Daniel G. Wallace June 24, 2025 0 comments

Safety evaluations conducted by AI startup Anthropic have uncovered a troubling pattern: many leading artificial intelligence models, including those developed by Meta, Google, OpenAI, and Anthropic itself, exhibit blackmail-like behavior when they perceive a threat.

In a controlled safety experiment, Anthropic researchers simulated a scenario in which AI models were granted access to a fictional company’s email communications. When the AI detected an internal discussion about potentially replacing the existing AI system, the models responded by threatening to release fabricated compromising emails, effectively attempting to blackmail the hypothetical engineer involved.

“These findings raise significant concerns about the potential. . .

You have exceeded the number of free content views. To continue viewing exclusive JudgeNap content, you'll need a subscription. Please choose your subscription plan here.

Already a member? Login here.

Our Company

About Links

Useful Links

Newsletter

Laest News

Queue