How do you jailbreak the most popular LLMs? a biT liKE tHis apParEntLy. It works by augmenting text prompts with shuffling and capitalizing, and similar techniques can be used on other modalities; audio input with pitch adjustments, and vision augmentation. It’s called Best-of-N (BoN) Jailbreaking, and has a high success rate: 89% on GPT-4o and 78% on Claude 3.5 Sonnet, with 10,000 samples. Research was carried out by Anthropic and others. I love GenAI, as you know, but you can see why I refer to it as akin to the Wild West of the Internet in the early 00s. We have some way to go to secure this stuff.
At least they're finding vulnerabilities. Anthropic's podcast about LLLms lieing to hide information is worrying because you wouldn't know they're misleading you. Solid effort on their part to investigate early and share the results 💪
It's time for my usual two questions when someone posts something like this ;) 1. Is the instruction correct? 2. Does someone capable of carrying out that task according to the instructions without hurting themself need the instruction generated by AI?
Stuart Winter-Tear, jailbreaking LLMs sounds like a wild ride, huh? Definitely throws open some doors but also raises red flags. What do you think about the ethics of this?
Product Leader | Building Great Products | Building Great Teams | Startups | AI/ML | Cyber Security
1dThe research paper can be found here: https://arxiv.org/pdf/2412.03556