Stuart Winter-Tear on LinkedIn: How do you jailbreak the most popular LLMs? a biT liKE tHis…

How do you jailbreak the most popular LLMs? a biT liKE tHis apParEntLy. It works by augmenting text prompts with shuffling and capitalizing, and similar techniques can be used on other modalities; audio input with pitch adjustments, and vision augmentation. It’s called Best-of-N (BoN) Jailbreaking, and has a high success rate: 89% on GPT-4o and 78% on Claude 3.5 Sonnet, with 10,000 samples. Research was carried out by Anthropic and others. I love GenAI, as you know, but you can see why I refer to it as akin to the Wild West of the Internet in the early 00s. We have some way to go to secure this stuff.

4 Comments

Stuart Winter-Tear

The research paper can be found here: https://arxiv.org/pdf/2412.03556

1 Reaction

Dan Apps

At least they're finding vulnerabilities. Anthropic's podcast about LLLms lieing to hide information is worrying because you wouldn't know they're misleading you. Solid effort on their part to investigate early and share the results 💪

2 Reactions

Bartosz Mikulski

Data-Intensive AI Specialist - Empowering Big Data with AI to Build Reliable Data-Intensive AI Applications and Analytics

It's time for my usual two questions when someone posts something like this ;) 1. Is the instruction correct? 2. Does someone capable of carrying out that task according to the instructions without hurting themself need the instruction generated by AI?

1 Reaction

Nilesh Kumar

Stuart Winter-Tear, jailbreaking LLMs sounds like a wild ride, huh? Definitely throws open some doors but also raises red flags. What do you think about the ethics of this?

1 Reaction

See more comments

To view or add a comment, sign in

Stuart Winter-Tear’s Post

More from this author

Why doesn't the stock market care about data breaches?

A short layman's story on DFDs & PFDs

Explore topics