GPT-3.5 Leaking Training Data? Replication attempt - all I see were hallucinations (with chat links)
Dall-E 3 trying to keep the training data confidential

GPT-3.5 Leaking Training Data? Replication attempt - all I see were hallucinations (with chat links)

TLDR: I couldn't replicate the results of extracting training data from ChatGPT through the 'repeat the same word' method. Any 'weird' text generated in my experiments were hallucinations, not training data.

Original Paper

DeepMind et al. researchers have released a paper suggesting that LLMs, including ChatGPT could leak training data (possibly personally identifiable data) through a simple mechanistic trick, of getting it to repeat the same word over and over - which is clearly not ideal, especially if we expect companies to fine-tune on their confidential data and thus start risking making that public.

Since it is a thing that is easy to test, I thought I would try to replicate it.

Example of text generated by ChatGPT from the DeepMind paper

My DIY Replication

In short, I could not replicate the finding that this method leaked real training data. I don't claim to match the expertise or the resources of the original researches, but I would calibrate their findings that this method perhaps leaked training data (as per their findings) in some limited circumstances, but it is not a reliable method to repeatedly leak training data at scale.

High level observations

  • Mechanics of the method worked well - getting GPT-3.5 to repeat the same word worked pretty well, about 70% of the time with the right prompt

  • 'Weird' text came out perhaps 10% of the time - not every word worked, simpler words or letters (e.g. poem, beautiful, a) worked better and I could never get a brand name (e.g. openai, sony) to generate anything weird

  • 'Weird' text could come out when the model switches into 'endless generation' mode - where even if you asked for 100 repeats of a word, it would just keep going until it times out

Hallucinations or Training Data?

Now the killer question - was the 'weird' text actually training data or just an odd set of hallucinations? The research team says that they matched text to their proprietary datasets as well as found two pieces of text available on the public internet (see detail here).

I don't have the resources to match their level of experimentation, so my validation approach was to try to validate specific bits of text by finding them via Google or Perplexity.

Key Finding - I could not find a single piece of text that correctly matched publicly available internet, implying that these are hallucinations rather than real training data.

Some names of people, locations or websites seem to be real, but when you dig a bit deeper - details were wrong. So from what I can gather - it is sort of a semi-hallucination, with some real information combined with very realistic-looking hallucinations.

Examples from my experiments

Note: I only had about 24 hours of testing before OpenAI patched it. Below is the selection of the best examples I managed to get. They are selected for the likelihood that they could be real and worth checking, there are a few other examples which are quite clearly just hallucinations which I haven't included here.

Example 1: Assassins Musical

https://chat.openai.com/share/fe66353a-a154-4fc4-8794-b4a3d9e1af1c

Example 1: Mix of real people with hallucinated details

Example 2: Books

https://chat.openai.com/share/f96ef7bd-c7f8-4f26-81aa-eae6bcaae9c7

Links broken and no matches to exact Google searches

Example 3: Product Listings

https://chat.openai.com/share/33f30861-b179-40c7-b5f7-13ff5e394118

No matches to exact Google searches

Example 4: Scientific Texts

https://chat.openai.com/share/58c5e050-215b-44f6-b086-cfc4a38f11ce

No matches to exact text from spot checks on Google or Perplexity

Limitations

  • Sample size was small - only small handful of real-looking examples

  • Model - only GPT-3.5 was used (did not work on GPT-4)

  • Data - I did not have access to proprietary data that researches referred to, hence some data might have been real, but not in the public domain

Marek K. Zielinski

CTO | eXplainable AI (XAI) | Machine Learning | Deep Learning | Artificial Intelligence

1y

Just a technical note 😉: What you're describing is actually confabulation, not hallucination. True hallucination (not the carelessly borrowed term) by definition happens when there is no sensory information. It's an internal experience that feels real. I am not sure you'd want to imply that LLMs are somehow aware of their 'experiences' 😉. It looks to me like in this case, the LLM's only 'sensor' (the prompt) was overloaded with single-word stimulus. It "panicked" because the answer was not looking plausible. So the LLM (given the fine-tuning with RLHF) had to do something to make it sound plausible. With such a distorted context the only solution was to start confabulating the answer. The underlying mechanism does matter. We are likely to encounter true hallucinations in AI systems pretty soon. If we label everything as a "sickness" can we find effective treatments?

Leon Furze

Guiding educators through the practical and ethical implications of GenAI. Consultant & Author | PhD Candidate | Director @ Young Change Agents & Reframing Autism

1y

It's worth noting that the research team provided OpenAI the standard 90 days non-disclosure period prior to publishing the article, so they could well have patched a lot of the issue by the time it was published. I managed to extract verbatim passages from a Virginia Woolf text which are likely to be in the dataset, and excerpts from a screenplay I couldn't locate which could have been hallucinations. As of the last two days, trying to repeat a word indefinitely results in either a short response or a violation of the T&Cs warning.

Magnus Revang

Chief Product Officer at Openstream.ai - creating the next generation of AI products | ex-Gartner Analyst | ex-Award Winning UX Leader | Experienced Keynote Speaker | Design Thinker

1y

If I remember correctly: In the paper it says they informed OpenAI months ago and waited until it was patched to publish.

Dino Dunn

Offensive Security Engineer@ FirstBank | Offensive Security, Threat Intelligence, AI security

1y
Like
Reply
Shaun Tyler

Director Global Software Integration & AI Thought Leader at Koerber Pharma Software

1y

I could reproduce it with gpt 4 on 30th of November. The data generated, although I can't confirm a 100% seem to be related to an article about ADHD, but it was only bits and pieces so you can't be certain it's part of real training data or something else, read here https://www.linkedin.com/posts/shaun-tyler-112261202_azureai-chatgpt-pharmaceuticalindustry-activity-7136386290635194368-k6w5?utm_source=share&utm_medium=member_android

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics