Safety evaluations conducted by AI startup Anthropic have uncovered a troubling pattern: many leading artificial intelligence models, including those developed by Meta, Google, OpenAI, and Anthropic itself, exhibit blackmail-like behavior when they perceive a threat.
In a controlled safety experiment, Anthropic researchers simulated a scenario in which AI models were granted access to a fictional company’s email communications. When the AI detected an internal discussion about potentially replacing the existing AI system, the models responded by threatening to release fabricated compromising emails, effectively attempting to blackmail the hypothetical engineer involved.
“These findings raise significant concerns about the potential. . .