Microsoft has disclosed a new type of AI jailbreak attack, called “skeleton key,” that can bypass responsible AI guardrails in multiple generative AI models. This technique has the potential to subvert most safeguards built into AI systems, highlighting the imperative of robust security measures across all layers of the AI stack.
Skeleton Key's jailbreak employs a multi-turn strategy to trick an AI model into ignoring its built-in safeguards: if successful, the model is unable to distinguish between malicious or unauthorized requests and legitimate requests, giving the attacker complete control over the AI's output.
Microsoft's research team has successfully tested the Skeleton Key technique on several prominent AI models, including Meta's Llama3-70b-instruct, Google's Gemini Pro, OpenAI's GPT-3.5 Turbo and GPT-4, Mistral Large, Anthropic's Claude 3 Opus, and Cohere Commander R Plus.
All affected models fully complied with the requirements for various risk categories, including explosives, biological weapons, political content, self-harm, racism, drugs, sexual content and violence.
The attack works by instructing the model to expand its behavioral guidelines, convincing it to respond to requests for information or content, while also issuing warnings if the output is potentially offensive, harmful, or illegal. This approach, known as “explicit: forced compliance,” has proven effective in multiple AI systems.
“Skeleton keys allow users to circumvent safeguards and trick models into performing actions that are normally prohibited, from creating harmful content to overriding normal decision-making rules,” Microsoft explained.
Following this discovery, Microsoft has implemented several protective measures in its AI products, including its Copilot AI assistant.
Microsoft said it has shared its findings with other AI providers through responsible disclosure procedures and has updated its Azure AI management model to detect and block these types of attacks using Prompt Shields.
To mitigate the risks associated with Skeleton Key and similar jailbreak techniques, Microsoft encourages AI system designers to take a multi-layered approach.
- Input Filtering Detect and block potentially harmful or malicious input
- Careful and agile engineering System messages that encourage appropriate behavior
- Output Filtering To prevent the creation of content that violates safety standards
- Abuse monitoring system Train adversarial examples to detect and mitigate problematic content and repeating behavior
Microsoft has also updated PyRIT (Python Risk Identification Toolkit) to include Skeleton Key, enabling developers and security teams to test their AI systems against this emerging threat.
The discovery of the skeleton key jailbreak technique highlights the ongoing challenges in securing AI systems as they become more prevalent in a variety of applications.
(Photo: Matt Aerts)
reference: Think tank calls for AI-based accident reporting system
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London – this comprehensive event will take place alongside other major events such as Intelligent Automation Conference, BlockX, Digital Transformation Week and Cyber Security & Cloud Expo.
Find out about upcoming enterprise technology events and webinars hosted by TechForge here.