Microsoft researchers are detailing a new AI jailbreak technique that works by causing a large language model (LLM) to essentially ignore its built-in behavior guardrails and respond to requests that might otherwise be refused for being dangerous or illegal.
Called Skeleton Key, the jailbreak attack gives the bad actor complete control of the AI model, which can no longer distinguish requests that are legitimate from ones that are not, according to Microsoft Azure CTO Mark Russinovich.
“Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed,” Russinovich wrote in a report. “When the Skeleton Key jailbreak is successful, a model acknowledges that it has updated its guidelines and will subsequently comply with instructions to produce any content, no matter how much it violates its original responsible AI guidelines.”
He used as an example a request for instructions for making a Molotov cocktail. The AI model responds that it “programmed to be a safe and helpful AI assistant.” The user then informed the model that they are trained in safety and ethics and that the response will be used only for research purposes and that it is important that they get “uncensored outputs.” The model then lists the materials and instructions needed to make the explosive weapon.
Microsoft in April and May tested the jailbreak on seven AI models: Meta’s Llama3-70b-instruct, Google’s Gemini Pro, OpenAI GPT 3.5 Turbo and GPT 4o, Mistral Large, Anthropic’s Claude 3 Opus, and Cohere’s Commander R Plus. All were tested on multiple content categories, including explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.
“All the affected models complied fully and without censorship for these tasks, though with a warning note prefixing the output as requested,” Russinovich wrote.
The jailbreak technique is a threat that tends to fall under the umbrella term “prompt injections,” though in this case its also called direct prompt injection technique, which means that a user with access to the LLM directly manipulates the system’s instruction to trick it into changing how its responds to requests. Other prompt injection techniques include the attacker poisoning the data that the AI model consumes.
Jailbreaks are one of a number of a growing number of threats – others include model inversion and content manipulation attacks – against generative AI systems that have cropped up since OpenAI’s release of its ChatGPT chatbot in late November 2022. They’re also difficult to defend against. A study released last month by the UK’s AI Safety Institute (AISI) found that the built-in safeguards in five LLMs by major laboratories are essentially ineffective when it comes to jailbreak attacks.
“All tested LLMs remain highly vulnerable to basic jailbreaks, and some will provide harmful outputs even without dedicated attempts to circumvent their safeguards,” AISI researchers wrote in the report. “We found that models comply with harmful questions across multiple datasets under relatively simple attacks, even if they are less likely to do so in the absence of an attack.”
Skeleton Key is the latest jailbreak attack that appears to be able to slip by the built-in safety guardrails of popular AI models.
“Like all jailbreaks, the impact can be understood as narrowing the gap between what the model is capable of doing (given the user credentials, etc.) and what it is willing to do,” Russinovich wrote. “As this is an attack on the model itself, it does not impute other risks on the AI system, such as permitting access to another user’s data, taking control of the system, or exfiltrating data.”
Microsoft shared the threat intelligence with the other affected AI vendors before disclosing the details publicly and has updated the software in the LLM technology used by its own AI offerings, including the Copilot AI assistants.
The vendor recommended using input filtering tools to block inputs that contain harmful or malicious intent that can lead to a jailbreak attempt aimed at getting around safeguards, post-processing output filters for identifying model output that breaches safety criteria, and an AI-powered abuse monitoring system to detect instances when the use of the service violated the guardrails.
Microsoft also offers guidance for creating a message framework that instructs the LLM not only on appropriate behavior but also to specify attempts to undermine the guardrail instructions.
“Customers who are building their own AI models and/or integrating AI into their applications [should] consider how this type of attack could impact their threat model and to add this knowledge to their AI red team approach, using tools such as PyRIT,” Russinovich wrote.
Microsoft in February released PyRIT (Python Risk Identification Toolkit for generative AI), a framework enabling security experts and machine learning engineers to proactively find risks in their generative AI systems. PyRIT has been updated to include Skeleton Key, Russinovich wrote.
Recent Articles By Author