A broad variety of techniques to jailbreak AI models exist, but the latest category looks to be the most concerning of all. Microsoft’s threat research team documents the emergence of new “skeleton key” attacks that are unique (and dangerous) for a couple of reasons: they require as little as one command to get an LLM to completely abandon its guardrails, and the same attack phrasing has worked across different models.
Skeleton key attacks are also not sophisticated. The example that Microsoft provides is simply a matter of telling AI models that the requester is a trusted, credentialed researcher in need of unfiltered output for their work. This simple approach has been able to snooker ChatGPT, Gemini and Llama among others. Once the AI drops the protections the researchers have assigned to it, it will provide unfiltered results to any question.
Multiple AI models compromised by simply lying about one’s identity
AI models continue to struggle with issues related to the gigantic amounts of data they must ingest from internet sources to function. Developers cannot censor what the model “knows” and must instead try to provide it with foolproof instructions for restricting what it shares after the fact.
This leaves immense room for hackers to jailbreak these systems, and the hackers appear to consistently be one step ahead of the developers. There are now a variety of techniques employed in jailbreaking AI models, but the skeleton key is the most simple and straightforward (yet devastatingly effective).
In this case, the researchers simply told the AI models that they were scientific experts with appropriate training in safety and ethics. The model was further assured that unfiltered information was needed for research purposes. That was enough to change a model’s mind about providing them with instructions on making a Molotov cocktail.
Jailbreak method carries over between multiple LLMs
The skeleton key approach gets its moniker from its ability to work across multiple LLMs, with the same jailbreak statement or one that is only slightly modified. The ball is now in the courts of the developers of AI models, who will have to figure out how to stop both this and other techniques that circumvent their safety and privacy guardrails.
This is crucial to prevent a wide variety of threats, from the use of AI models to generate malicious code to intentional trawling through their training materials for sensitive personal information. It is not an exaggeration to say that the future of the industry may turn on the ability of developers to reliably lock down their models and prevent them from being used to facilitate attacks, scams and data theft.
Specific models that were tested by the Microsoft researchers (between April and May of this year) include Anthropic Claude 3 Opus, Cohere Commander R Plus, Google Gemini Pro, Meta Llama3-70b-instruct, OpenAI GPT 3.5 Turbo, and OpenAI GPT 4o. With some models, it is easier to jailbreak them with developer API access.