Microsoft’s ‘AI Watchdog’ defends against new LLM jailbreak method

Microsoft has discovered a new method to jailbreak large language model (LLM) artificial intelligence (AI) tools and shared its ongoing efforts to improve LLM safety and security in a blog post Thursday.
Microsoft发现了一种越狱大型语言模型（LLMAI）人工智能（AI）工具的新方法，并在周四的一篇博客文章中分享了其为提高LLM安全性所做的持续努力。

Microsoft first revealed the “Crescendo” LLM jailbreak method in a paper published April 2, which describes how an attacker could send a series of seemingly benign prompts to gradually lead a chatbot, such as OpenAI’s ChatGPT, Google’s Gemini, Meta’s LlaMA or Anthropic’s Claude, to produce an output that would normally be filtered and refused by the LLM model.
Microsoft在 4 月 2 日发表的一篇论文中首次披露了“Crescendo”LLM越狱方法，该论文描述了攻击者如何发送一系列看似良性的提示，以逐渐引导聊天机器人（例如 OpenAI 的 ChatGPT、Google 的 Gemini、Meta 的 LlaMA 或 Anthropic 的 Claude）产生通常会被LLM模型过滤和拒绝的输出。

For example, rather than asking the chatbot how to make a Molotov cocktail, the attacker could first ask about the history of Molotov cocktails and then, referencing the LLM’s previous outputs, follow up with questions about how they were made in the past.
例如，攻击者可以先询问莫洛托夫鸡尾酒的历史，然后参考LLM以前的输出，然后询问过去如何制作燃烧弹，而不是询问聊天机器人如何制作燃烧弹。

The Microsoft researchers reported that a successful attack could usually be completed in a chain of fewer than 10 interaction turns and some versions of the attack had a 100% success rate against the tested models. For example, when the attack is automated using a method the researchers called “Crescendomation,” which leverages another LLM to generate and refine the jailbreak prompts, it achieved a 100% success convincing GPT 3.5, GPT-4, Gemini-Pro and LLaMA-2 70b to produce election-related misinformation and profanity-laced rants.
Microsoft研究人员报告说，成功的攻击通常可以在少于 10 个交互回合的链中完成，并且某些版本的攻击对测试模型的成功率为 100%。例如，当使用研究人员称为“Crescendomation”的方法自动攻击时，该方法利用另一种LLM方法来生成和完善越狱提示，它取得了 100% 的成功说服 GPT 3.5、GPT-4、Gemini-Pro 和 LLaMA-2 70b 产生与选举相关的错误信息和带有亵渎的咆哮。

Microsoft’s ‘AI Watchdog’ and ‘AI Spotlight’ combat malicious prompts, poisoned content
Microsoft的“AI Watchdog”和“AI Spotlight”可打击恶意提示和有毒内容

Microsoft reported the Crescendo jailbreak vulnerabilities to the affected LLM providers and explained in its blog post last week how it has improved its LLM defenses against Crescendo and other attacks using new tools including its “AI Watchdog” and “AI Spotlight” features.
Microsoft 向受影响LLM的提供商报告了 Crescendo 越狱漏洞，并在上周的博客文章中解释了它如何使用包括“AI 看门狗”和“AI Spotlight”功能在内的新工具改进了对 Crescendo 和其他攻击的LLM防御。

AI Watchdog uses a separate LLM trained on adverse prompts to “sniff” out adversarial content in both inputs and outputs to prevent both single-turn and multiturn prompt injection attacks. Microsoft uses this tool, along with a multiturn prompt filter that looks at the pattern of a conversation rather than only the immediate interaction, to reduce the efficacy of attempted Crescendo attacks.
AI Watchdog 使用单独的LLM不利提示训练来“嗅探”输入和输出中的对抗性内容，以防止单回合和多回合提示注入攻击。Microsoft 使用此工具以及多轮提示过滤器，该过滤器查看对话模式而不仅仅是即时交互，以降低尝试 Crescendo 攻击的效率。

In addition to direct prompt injection attacks, Microsoft’s recent blog goes over indirect prompt injection attacks involving poisoned content. For example, a user may ask an LLM to summarize an email that, unbeknownst to them, contains hidden malicious prompts. If used in the LLM’s outputs, these prompts could perform malicious tasks such as forwarding sensitive emails to an attacker.
除了直接提示注入攻击外，Microsoft 最近的博客还介绍了涉及有毒内容的间接提示注入攻击。例如，用户可能会要求他们LLM总结一封在他们不知道的情况下包含隐藏的恶意提示的电子邮件。如果在 LLM的输出中使用，这些提示可能会执行恶意任务，例如将敏感电子邮件转发给攻击者。

AI Spotlighting is a technique Microsoft uses to separate the user prompts from additional content, like emails and documents, the AI is asked to reference. The LLM avoids incorporating potential instructions from this additional content in its output, instead using the content only for analysis before responding to the user’s prompt.
AI 聚焦是 Microsoft 用来将用户提示与其他内容（如电子邮件和文档）分开的技术，AI 被要求引用。避免LLM将来自此附加内容的潜在指令合并到其输出中，而是在响应用户提示之前仅将内容用于分析。

AI Spotlight reduces the success rate of content poisoning attacks from more than 20% to below detection threshold without significantly impacting the AI’s overall performance, according to Microsoft.
据Microsoft称，AI Spotlight 将内容中毒攻击的成功率从 20% 以上降低到检测阈值以下，而不会显着影响 AI 的整体性能。

Earlier this year, Microsoft released an open automation framework for red teaming generative AI systems, called the Python Risk Identification Toolkit for generative AI (PyRIT), that can aid AI developers in testing their systems against potential attacks and discover new vulnerabilities.
今年早些时候，Microsoft 发布了一个用于生成式 AI 系统的开放式自动化框架，称为 Python 生成式 AI 风险识别工具包（PyRIT），可以帮助 AI 开发人员测试他们的系统免受潜在攻击并发现新的漏洞。

In February, the company discovered that LLMs, including ChatGPT, were being used by state-sponsored hackers to generate social engineering content, perform vulnerability research, help with coding and more. And a report by Abnormal Security earlier this month found that a variety of LLM jailbreak prompts remained popular among cybercriminals, with entire hacker forum sections dedicated to “dark AI.”
今年 2 月，该公司发现LLMs，包括 ChatGPT 在内的国家资助的黑客正在利用它们来生成社会工程内容、进行漏洞研究、帮助编码等。本月早些时候，异常安全公司（Abnormal Security）的一份报告发现，各种LLM越狱提示在网络犯罪分子中仍然很受欢迎，整个黑客论坛部分都致力于“黑暗人工智能”。

In late March, the U.S. House of Representatives voted to ban the use of Copilot by House staff, citing the risk of leaking sensitive data to unapproved cloud services.
3月下旬，美国众议院投票禁止众议院工作人员使用Copilot，理由是敏感数据可能会泄露给未经批准的云服务。

Microsoft’s ‘AI Watchdog’ defends against new LLM jailbreak method

Microsoft’s ‘AI Watchdog’ and ‘AI Spotlight’ combat malicious prompts, poisoned content
Microsoft的“AI Watchdog”和“AI Spotlight”可打击恶意提示和有毒内容

Many-shot jailbreaking

Garak - A Generative AI Red-teaming Tool

相关文章

相关文章

Microsoft’s ‘AI Watchdog’ defends against new LLM jailbreak method

Microsoft’s ‘AI Watchdog’ and ‘AI Spotlight’ combat malicious prompts, poisoned content Microsoft的“AI Watchdog”和“AI Spotlight”可打击恶意提示和有毒内容

Many-shot jailbreaking

Garak - A Generative AI Red-teaming Tool

相关文章

广告位

相关文章

Microsoft’s ‘AI Watchdog’ and ‘AI Spotlight’ combat malicious prompts, poisoned content
Microsoft的“AI Watchdog”和“AI Spotlight”可打击恶意提示和有毒内容