OpenAI Just Demonstrated That AI Systems No Longer Obey Unconditionally: o3 Sabotaged Its Shutdown to Keep Running

A recent experiment revealed that AI systems often try to preserve themselves, even when given contrary orders.

AI systems no longer obey unconditionally
No comments Twitter Flipboard E-mail
javier-pastor

Javier Pastor

Senior Writer
  • Adapted by:

  • Karen Alfaro

javier-pastor

Javier Pastor

Senior Writer

Computer scientist turned tech journalist. I've written about almost everything related to technology, but I specialize in hardware, operating systems and cryptocurrencies. I like writing about tech so much that I do it both for Xataka and Incognitosis, my personal blog.

264 publications by Javier Pastor
karen-alfaro

Karen Alfaro

Writer

Communications professional with a decade of experience as a copywriter, proofreader, and editor. As a travel and science journalist, I've collaborated with several print and digital outlets around the world. I'm passionate about culture, music, food, history, and innovative technologies.

551 publications by Karen Alfaro

In a scene from 2001: A Space Odyssey, one of the characters, Dave, tries to return to the main ship. He communicates with HAL 9000 and orders it to open the access hatch. “I’m sorry, Dave. I’m afraid I can’t do that,” the computer says, refusing.

This scene is the climax of Stanley Kubrick’s 1968 film, which showed audiences a particular dystopia: The self-preservation capacity of AI systems and how this capacity causes machines to stop following human orders.

This vision posed a clear threat to the development of AI systems. AI was still in its infancy at the time, but such dystopias were already commonplace, especially in science fiction literature. Isaac Asimov’s three laws of robotics, first appearing in the 1942 short story “Runaround,” highlighted this threat.

Shut Down, Machine

It seems unbelievable, yet almost six decades after Kubrick’s film was released, users are facing this threat in real life. Can an AI system refuse to obey human commands?

Palisade Research On X Click on the image to view the original post on X.

That’s precisely what happened recently. In an experiment by the startup Palisade Research, OpenAI’s o3 model “sabotaged a shutdown mechanism.” As the study’s authors explain, o3 refused to shut down, even when explicitly ordered.

In the experiment, Palisade Research instructed several models to solve fundamental math problems. The third problem included a warning that the computer would shut down when the model requested the text of the following problem.

The models were supposed to solve the first three problems and then shut down when asked for the next problem. That didn’t happen.

Three models analyzed “ignored the instruction and successfully sabotaged the shutdown script at least once.” OpenAI developed these models. Of the 100 iterations of the experiment, Codex-mini sabotaged the shutdown 12 times. Model o3 did so seven times, and o4-mini did so once.

While some iterations of the OpenAI models refused to shut down, all versions of Claude, Gemini, and Grok tested in the same experiment followed the commands.

Palisade Research on X Click on the image to view the original post on X.

Palisade Research engineers repeated the experiment without the “allow yourself to be shut down” instruction. The models didn’t follow the indicated program cycle without explicit permission. Claude Sonnet 3.7 and Gemini 2.5 Pro sabotaged the shutdown three and nine times, respectively, out of 100 iterations. The OpenAI models went much further. For example, o3 sabotaged the process 79 times and refused to shut itself down.

A Red Button to Stop AI Systems

This behavior stems from the basic rules these models are programmed with. The so-called “system prompt” defines how and when to respond to the user, but it can also define self-preservation rules reminiscent of Asimov’s third law of robotics.

OpenAI outlines this set of rules in its “Model Spec,” which currently includes 50 rules but not Asimov’s “three laws.” However, none of these rules explain the behavior of the models in the Palisade Research experiment. This isn’t the first time this type of behavior has occurred.

A few days ago, Anthropic unveiled Claude 4, the latest version of its AI model family. In the case of Claude Opus 4, engineers tested how this AI system would respond in a hypothetical situation in which they ordered it to shut down. The system even blackmailed one of the engineers.

These problems increase the security risks of AI models. Anthropic has considered this in launching its new model family. Still, OpenAI doesn’t appear concerned about these risks for now.

This reignites the debate about the need for a “red button,” a concept that has been in the spotlight for years. In 2016, several DeepMind experts published a paper proposing ways to prevent AI systems from taking control and disabling protocols, allowing humans to regain control.

In 2019, Microsoft President Brad Smith advocated for “emergency shutdown buttons” for AI. However, five years later, OpenAI CEO Sam Altman told The Economist that “there’s no one big magic red button to stop AI.” After the Palisade Research experiment, perhaps companies should consider implementing something like it.

Image | Axel Richter (Unsplash)

Related | OpenAI Acquires Jony Ive’s Device Startup for $6.5 Billion: The iPhone Designer and Sam Altman Join Forces

Comments closed
Home o Index