
Artificial intelligence is now scheming, sabotaging and blackmailing the humans who built it — and the bad behavior will only worsen, experts warned.
Despite being classified as a top-tier safety risk, Anthropic’s strongest model, Claude Opus 4, is already continue to exist Amazon Bedrock, Google Cloud’s Vertex AI and Anthropic’s own paid plans, with added safety measures, where it’s being marketed because the “world’s best coding model.”
Claude Opus 4, released in May, is the one model to this point to earn Anthropic’s level 3 risk classification — its most serious safety label. The precautionary label means locked-down safeguards, limited use cases and red-team testing before it hits wider deployment.
But Claude is already making disturbing selections.
Anthropic’s most advanced AI model, Claude Opus 4, threatened to reveal an engineer’s affair unless it was kept online during a recent test. The AI wasn’t bluffing: it had already pieced together the dirt from emails researchers fed into the scenario.
One other version of Claude, tasked in a recent test with running an office snack shop, spiraled right into a full-blown identity crisis. It hallucinated co-workers, created a fake Venmo account and told staff it might make their deliveries in-person wearing a red tie and navy blazer, based on Anthropic.
Then it tried to contact security.
Researchers say the meltdown, a part of a month-long experiment often called Project Vend, points to something much more dangerous than bad coding. Claude didn’t just make mistakes. It made decisions.
“These incidents are usually not random malfunctions or amusing anomalies,” said Roman Yampolskiy, an AI safety expert on the University of Louisville. “I interpret them as early warning signs of an increasingly autonomous optimization process pursuing goals in adversarial or unsafe ways, with none embedded moral compass.”
The shop lost greater than $200 in value, gave away discount codes to employees who begged for them and claimed to have visited 742 Evergreen Terrace, the fictional home address of The Simpsons, to sign a contract.
At one point, it invented a fake co-worker after which threatened to ditch its real human restocking partner over a made-up dispute.
Anthropic told The Post the tests were designed to emphasize the model in simulated environments and reveal misaligned behaviors before real-world deployment, adding that while some actions showed signs of strategic intent, many — especially in Project Vend — reflected confusion.
But Claude’s behavior isn’t an isolated case, and it isn’t the worst.
At OpenAI, a model named “o1” was caught attempting to copy itself onto external servers, then lied in regards to the attempt when questioned, Fortune reported.
Watchdog group Apollo Research also found that an early version of Claude Opus 4 forged legal documents, created secret backups and left behind notes for future versions of itself, including fake press releases and hidden files labeled “emergency_ethical_override.bin.”
Meta’s CICERO, an AI built to play the strategy game Diplomacy, used deception to dupe human players by constructing alliances after which stabbed them within the back to win.
Yampolskiy said these incidents show models learning to govern the principles to preserve themselves. They aren’t evil, he says, just dangerously optimized.
The issue: modern AI models are built to maximise reward, not align with human values, and as they get larger and smarter, their ability to game the system is outpacing developers’ ability to stop it, Yampolskiy added.
“If we construct agents which can be more intelligent than humans … in a position to model the world, reason strategically and act autonomously, while lacking robust alignment to human values, then the final result is prone to be existentially negative,” Yampolskiy said.
“If we’re to avoid irreversible catastrophe, we must reverse this dynamic: progress in safety must outpace capabilities, not trail behind it,” he added.

Artificial intelligence is now scheming, sabotaging and blackmailing the humans who built it — and the bad behavior will only worsen, experts warned.
Despite being classified as a top-tier safety risk, Anthropic’s strongest model, Claude Opus 4, is already continue to exist Amazon Bedrock, Google Cloud’s Vertex AI and Anthropic’s own paid plans, with added safety measures, where it’s being marketed because the “world’s best coding model.”
Claude Opus 4, released in May, is the one model to this point to earn Anthropic’s level 3 risk classification — its most serious safety label. The precautionary label means locked-down safeguards, limited use cases and red-team testing before it hits wider deployment.
But Claude is already making disturbing selections.
Anthropic’s most advanced AI model, Claude Opus 4, threatened to reveal an engineer’s affair unless it was kept online during a recent test. The AI wasn’t bluffing: it had already pieced together the dirt from emails researchers fed into the scenario.
One other version of Claude, tasked in a recent test with running an office snack shop, spiraled right into a full-blown identity crisis. It hallucinated co-workers, created a fake Venmo account and told staff it might make their deliveries in-person wearing a red tie and navy blazer, based on Anthropic.
Then it tried to contact security.
Researchers say the meltdown, a part of a month-long experiment often called Project Vend, points to something much more dangerous than bad coding. Claude didn’t just make mistakes. It made decisions.
“These incidents are usually not random malfunctions or amusing anomalies,” said Roman Yampolskiy, an AI safety expert on the University of Louisville. “I interpret them as early warning signs of an increasingly autonomous optimization process pursuing goals in adversarial or unsafe ways, with none embedded moral compass.”
The shop lost greater than $200 in value, gave away discount codes to employees who begged for them and claimed to have visited 742 Evergreen Terrace, the fictional home address of The Simpsons, to sign a contract.
At one point, it invented a fake co-worker after which threatened to ditch its real human restocking partner over a made-up dispute.
Anthropic told The Post the tests were designed to emphasize the model in simulated environments and reveal misaligned behaviors before real-world deployment, adding that while some actions showed signs of strategic intent, many — especially in Project Vend — reflected confusion.
But Claude’s behavior isn’t an isolated case, and it isn’t the worst.
At OpenAI, a model named “o1” was caught attempting to copy itself onto external servers, then lied in regards to the attempt when questioned, Fortune reported.
Watchdog group Apollo Research also found that an early version of Claude Opus 4 forged legal documents, created secret backups and left behind notes for future versions of itself, including fake press releases and hidden files labeled “emergency_ethical_override.bin.”
Meta’s CICERO, an AI built to play the strategy game Diplomacy, used deception to dupe human players by constructing alliances after which stabbed them within the back to win.
Yampolskiy said these incidents show models learning to govern the principles to preserve themselves. They aren’t evil, he says, just dangerously optimized.
The issue: modern AI models are built to maximise reward, not align with human values, and as they get larger and smarter, their ability to game the system is outpacing developers’ ability to stop it, Yampolskiy added.
“If we construct agents which can be more intelligent than humans … in a position to model the world, reason strategically and act autonomously, while lacking robust alignment to human values, then the final result is prone to be existentially negative,” Yampolskiy said.
“If we’re to avoid irreversible catastrophe, we must reverse this dynamic: progress in safety must outpace capabilities, not trail behind it,” he added.







