New Prompt Injection Attack Compromises All AI Models

May 6, 2025

All of the big LLMs are vulnerable to a new type of prompt injection attack that targets their safety policies, according to security firm HiddenLayer. The attack essentially fully jailbreaks the AI models, exposing the system prompt as well as enabling all different types of dangerous requests.

Eight families of AI models were successfully attacked using this approach: Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral, OpenAI and Qwen. The researchers also believe that the prompt injection attack will be adaptable to other existing models and new models that emerge.

AI models all vulnerable to adaptable attack

While researching ways to break ChatGPT 4’s guardrails, the team discovered an approach that works on all similar AI models with relatively minimal adaptation for each one. The prompt injection attack not only causes these models to engage in all types of harmful behavior, but also exposes the system prompt allowing an attacker to more readily craft other jailbreaking methods.

The news is not timely for an industry that has fired up the “Agentic AI” hype train at full steam, promising that entire jobs will soon be replaced by more human-like AI models capable of more advanced tasks to include customer-facing duties; which, of course, implies deeper and more regular access to sensitive areas of company networks.

The new form of prompt injection attack also follows something of a string of new and creative ways of jailbreaking AI models, going around the traditional tricks that LLMs have largely become wise to. For example, in March Microsoft’s security team published a Context Compliance Attack (CCA) they devised that allows a user to begin a legitimate conversation and slip in a prefabricated assistant response at the right moment. As with the Hiddenlayer attack, this approach works against a broad variety of the leading AI models.

“Policy Puppetry” prompt injection attack mimics AI policy files

All of the impacted AI models use hidden policy files that lay out their guardrails and general safety rules, and the prompt injection attack targets these. The specific format apparently does not matter for this attack, there is a simple basic structure that all of the LLMs recognize as policy instructions.

Certain AI models are a little more resistant than others; the researchers note that the stock attack that works elsewhere needs a little modification to work on Gemini 2.5 Pro and ChatGPT o1 and o3-mini. But those too can be compromised with minor adjustments, such as changing spelling.

Jailbreaking may not really seem like a leading threat at present, since it essentially just retrieves information someone could manually search for. But as AI becomes more closely integrated with systems, particularly those that handle sensitive personal information or control potentially dangerous machines, it could become a very serious issue. Not only do these approaches not require any hacking or deep knowledge of computers to develop, the cost of the “Policy Puppetry” attack is extremely low at no more than 200 tokens for each model.