Stuart Winter-Tear’s Post

View profile for Stuart Winter-Tear, graphic

Product Leader | Building Great Products | Building Great Teams | Startups | AI/ML | Cyber Security

How do you jailbreak the most popular LLMs? a biT liKE tHis apParEntLy. It works by augmenting text prompts with shuffling and capitalizing, and similar techniques can be used on other modalities; audio input with pitch adjustments, and vision augmentation. It’s called Best-of-N (BoN) Jailbreaking, and has a high success rate: 89% on GPT-4o and 78% on Claude 3.5 Sonnet, with 10,000 samples. Research was carried out by Anthropic and others. I love GenAI, as you know, but you can see why I refer to it as akin to the Wild West of the Internet in the early 00s. We have some way to go to secure this stuff.

  • No alternative text description for this image
Stuart Winter-Tear

Product Leader | Building Great Products | Building Great Teams | Startups | AI/ML | Cyber Security

1d
Dan Apps

Product Leader | 16 Years in Product Management | E-commerce SaaS Web Apps B2B B2C B2B2C | Artificial Intelligence | Product Transformation Consultant | Start-up to Enterprise

1d

At least they're finding vulnerabilities. Anthropic's podcast about LLLms lieing to hide information is worrying because you wouldn't know they're misleading you. Solid effort on their part to investigate early and share the results 💪

Bartosz Mikulski

Data-Intensive AI Specialist - Empowering Big Data with AI to Build Reliable Data-Intensive AI Applications and Analytics

1d

It's time for my usual two questions when someone posts something like this ;) 1. Is the instruction correct? 2. Does someone capable of carrying out that task according to the instructions without hurting themself need the instruction generated by AI?

Nilesh Kumar

Associate Director | Market Research | Healthcare IT Consultant | Healthcare IT Transformation | Head of Information Technolgy | IoT | AI | BI

1d

Stuart Winter-Tear, jailbreaking LLMs sounds like a wild ride, huh? Definitely throws open some doors but also raises red flags. What do you think about the ethics of this?

See more comments

To view or add a comment, sign in

Explore topics