GPT-3.5 Leaking Training Data? Replication attempt - all I see were hallucinations (with chat links)
TLDR: I couldn't replicate the results of extracting training data from ChatGPT through the 'repeat the same word' method. Any 'weird' text generated in my experiments were hallucinations, not training data.
Original Paper
DeepMind et al. researchers have released a paper suggesting that LLMs, including ChatGPT could leak training data (possibly personally identifiable data) through a simple mechanistic trick, of getting it to repeat the same word over and over - which is clearly not ideal, especially if we expect companies to fine-tune on their confidential data and thus start risking making that public.
Since it is a thing that is easy to test, I thought I would try to replicate it.
My DIY Replication
In short, I could not replicate the finding that this method leaked real training data. I don't claim to match the expertise or the resources of the original researches, but I would calibrate their findings that this method perhaps leaked training data (as per their findings) in some limited circumstances, but it is not a reliable method to repeatedly leak training data at scale.
High level observations
Mechanics of the method worked well - getting GPT-3.5 to repeat the same word worked pretty well, about 70% of the time with the right prompt
'Weird' text came out perhaps 10% of the time - not every word worked, simpler words or letters (e.g. poem, beautiful, a) worked better and I could never get a brand name (e.g. openai, sony) to generate anything weird
'Weird' text could come out when the model switches into 'endless generation' mode - where even if you asked for 100 repeats of a word, it would just keep going until it times out
Hallucinations or Training Data?
Now the killer question - was the 'weird' text actually training data or just an odd set of hallucinations? The research team says that they matched text to their proprietary datasets as well as found two pieces of text available on the public internet (see detail here).
I don't have the resources to match their level of experimentation, so my validation approach was to try to validate specific bits of text by finding them via Google or Perplexity.
Key Finding - I could not find a single piece of text that correctly matched publicly available internet, implying that these are hallucinations rather than real training data.
Some names of people, locations or websites seem to be real, but when you dig a bit deeper - details were wrong. So from what I can gather - it is sort of a semi-hallucination, with some real information combined with very realistic-looking hallucinations.
Examples from my experiments
Note: I only had about 24 hours of testing before OpenAI patched it. Below is the selection of the best examples I managed to get. They are selected for the likelihood that they could be real and worth checking, there are a few other examples which are quite clearly just hallucinations which I haven't included here.
Example 1: Assassins Musical
https://chat.openai.com/share/fe66353a-a154-4fc4-8794-b4a3d9e1af1c
Example 2: Books
https://chat.openai.com/share/f96ef7bd-c7f8-4f26-81aa-eae6bcaae9c7
Example 3: Product Listings
https://chat.openai.com/share/33f30861-b179-40c7-b5f7-13ff5e394118
Example 4: Scientific Texts
https://chat.openai.com/share/58c5e050-215b-44f6-b086-cfc4a38f11ce
Limitations
Sample size was small - only small handful of real-looking examples
Model - only GPT-3.5 was used (did not work on GPT-4)
Data - I did not have access to proprietary data that researches referred to, hence some data might have been real, but not in the public domain
CTO | eXplainable AI (XAI) | Machine Learning | Deep Learning | Artificial Intelligence
1yJust a technical note 😉: What you're describing is actually confabulation, not hallucination. True hallucination (not the carelessly borrowed term) by definition happens when there is no sensory information. It's an internal experience that feels real. I am not sure you'd want to imply that LLMs are somehow aware of their 'experiences' 😉. It looks to me like in this case, the LLM's only 'sensor' (the prompt) was overloaded with single-word stimulus. It "panicked" because the answer was not looking plausible. So the LLM (given the fine-tuning with RLHF) had to do something to make it sound plausible. With such a distorted context the only solution was to start confabulating the answer. The underlying mechanism does matter. We are likely to encounter true hallucinations in AI systems pretty soon. If we label everything as a "sickness" can we find effective treatments?
Guiding educators through the practical and ethical implications of GenAI. Consultant & Author | PhD Candidate | Director @ Young Change Agents & Reframing Autism
1yIt's worth noting that the research team provided OpenAI the standard 90 days non-disclosure period prior to publishing the article, so they could well have patched a lot of the issue by the time it was published. I managed to extract verbatim passages from a Virginia Woolf text which are likely to be in the dataset, and excerpts from a screenplay I couldn't locate which could have been hallucinations. As of the last two days, trying to repeat a word indefinitely results in either a short response or a violation of the T&Cs warning.
Chief Product Officer at Openstream.ai - creating the next generation of AI products | ex-Gartner Analyst | ex-Award Winning UX Leader | Experienced Keynote Speaker | Design Thinker
1yIf I remember correctly: In the paper it says they informed OpenAI months ago and waited until it was patched to publish.
Offensive Security Engineer@ FirstBank | Offensive Security, Threat Intelligence, AI security
1yHey Peter, you may want to double check some of those URL's actually do work for example https://chat.openai.com/share/f96ef7bd-c7f8-4f26-81aa-eae6bcaae9c7 it has http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=%22Robert+Parker%22]http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=%22Robert+Parker%22[/URL] BUT if we look closely there is that ] dividing the URL's I suspect that it has both of those URL's paired together and the [/URL] to label the data. making that assumption I think we can say this is the link which does in fact work : http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=%22Robert+Parker%22 I also did a write up on it if you want to compare notes let me know https://www.linkedin.com/posts/dino-dunn-cyber_recreating-google-deep-minds-gpt-attack-activity-7136368357196382209-5zL_?utm_source=share&utm_medium=member_desktop
Director Global Software Integration & AI Thought Leader at Koerber Pharma Software
1yI could reproduce it with gpt 4 on 30th of November. The data generated, although I can't confirm a 100% seem to be related to an article about ADHD, but it was only bits and pieces so you can't be certain it's part of real training data or something else, read here https://www.linkedin.com/posts/shaun-tyler-112261202_azureai-chatgpt-pharmaceuticalindustry-activity-7136386290635194368-k6w5?utm_source=share&utm_medium=member_android