Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
The age of generative AI is here: only six months after OpenAI‘s ChatGPT burst onto the scene, as many as half the employees of some leading global companies are already using this type of technology in their workflows, and many other companies are rushing to offer new products with generative AI built in.
But, as those following the burgeoning industry and its underlying research know, the data used to train the large language models (LLMs) and other transformer models underpinning products such as ChatGPT, Stable Diffusion and Midjourney comes initially from human sources — books, articles, photographs and so on — that were created without the help of artificial intelligence.
Now, as more people use AI to produce and publish content, an obvious question arises: What happens as AI-generated content proliferates around the internet, and AI models begin to train on it, instead of on primarily human-generated content?
A group of researchers from the UK and Canada have looked into this very problem and recently published a paper on their work in the open access journal arXiv. What they found is worrisome for current generative AI technology and its future: “We find that use of model-generated content in training causes irreversible defects in the resulting models.”
‘Filling the internet with blah’
Specifically looking at probability distributions for text-to-text and image-to-image AI generative models, the researchers concluded that “learning from data produced by other models causes model collapse — a degenerative process whereby, over time, models forget the true underlying data distribution … this process is inevitable, even for cases with almost ideal conditions for long-term learning.”
“Over time, mistakes in generated data compound and ultimately force models that learn from generated data to misperceive reality even further,” wrote one of the paper’s leading authors, Ilia Shumailov, in an email to VentureBeat. “We were surprised to observe how quickly model collapse happens: Models can rapidly forget most of the original data from which they initially learned.”
In other words: as an AI training model is exposed to more AI-generated data, it performs worse over time, producing more errors in the responses and content it generates, and producing far less non-erroneous variety in its responses.
As another of the paper’s authors, Ross Anderson, professor of security engineering at Cambridge University and the University of Edinburgh, wrote in a blog post discussing the paper: “Just as we’ve strewn the oceans with plastic trash and filled the atmosphere with carbon dioxide, so we’re about to fill the Internet with blah. This will make it harder to train newer models by scraping the web, giving an advantage to firms which already did that, or which control access to human interfaces at scale. Indeed, we already see AI startups hammering the Internet Archive for training data.”
Ted Chiang, acclaimed sci-fi author of “Story of Your Life,” the novella that inspired the movie Arrival, and a writer at Microsoft, recently published a piece in The New Yorker postulating that AI copies of copies would result in degrading quality, likening the problem to the increased artifacts visible as one copies a JPEG image repeatedly.
Another way to think of the problem is like the 1996 sci-fi comedy movie Multiplicity starring Michael Keaton, wherein a humble man clones himself and then clones the clones, each of which results in exponentially decreasing levels of intelligence and increasing stupidity.
How ‘model collapse’ happens
In essence, model collapse occurs when the data AI models generate ends up contaminating the training set for subsequent models.
“Original data generated by humans represents the world more fairly, i.e. it contains improbable data too,” Shumailov explained. “Generative models, on the other hand, tend to overfit for popular data and often misunderstand/misrepresent less popular data.”
Shumailov illustrated this problem for VentureBeat with a hypothetical scenario, wherein a machine learning model is trained on a dataset with pictures of 100 cats — 10 of them with blue fur, and 90 with yellow. The model learns that yellow cats are more prevalent, but also represents blue cats as more yellowish than they really are, returning some green-cat results when asked to produce new data. Over time, the original trait of blue fur erodes through successive training cycles, turning from blue to greenish, and ultimately yellow. This progressive distortion and eventual loss of minority data characteristics is model collapse. To prevent this, it’s important to ensure fair representation of minority groups in datasets, in terms of both quantity and accurate portrayal of distinctive features. The task is challenging due to models’ difficulty learning from rare events.
This “pollution” with AI-generated data results in models gaining a distorted perception of reality. Even when researchers trained the models not to produce too many repeating responses, they found model collapse still occurred, as the models would start to make up erroneous responses to avoid repeating data too frequently.
“There are many other aspects that will lead to more serious implications, such as discrimination based on gender, ethnicity or other sensitive attributes,” Shumailov said, especially if generative AI learns over time to produce, say, one race in its responses, while “forgetting” others exist.
It’s important to note that this phenomenon is distinct from “catastrophic forgetting,” where models lose previously learned information. In contrast, model collapse involves models misinterpreting reality based on their reinforced beliefs.
The researchers behind this paper found that even if 10% of the original human-authored data is used to train the model in subsequent generations, “model collapse still happens, just not as quickly,” Shumailov told VentureBeat.
Ways to avoid ‘model collapse’
Fortunately, there are ways to avoid model collapse, even with existing transformers and LLMs.
The researchers highlight two specific ways. The first is by retaining a prestige copy of the original exclusively or nominally human-produced dataset, and avoiding contaminating with with AI-generated data. Then, the model could be periodically retrained on this data, or refreshed entirely with it, starting from scratch.
The second way to avoid degradation in response quality and reduce unwanted errors or repetitions from AI models is to introduce new, clean, human-generated datasets back into their training.
However, as the researchers point out, this would require some sort of mass labeling mechanism or effort by content producers or AI companies to differentiate between AI-generated and human-generated content. At present, no such reliable or large-scale effort exists online.
“To stop model collapse, we need to make sure that minority groups from the original data get represented fairly in the subsequent datasets,” Shumailov told VentureBeat, continuing:
“In practice it is completely non-trivial. Data needs to be backed up carefully, and cover all possible corner cases. In evaluating performance of the models, use the data the model is expected to work on, even the most improbable data cases. Note that this does not mean that improbable data should be oversampled, but rather that it should be appropriately represented. As progress drives you to retrain your models, make sure to include old data as well as new. This will push up the cost of training, yet will help you to counteract model collapse, at least to some degree.”
What the AI industry and users can do about it going forward
While all this news is worrisome for current generative AI technology and the companies seeking to monetize with it, especially in the medium-to-long term, there is a silver lining for human content creators: The researchers conclude that in a future filled with gen AI tools and their content, human-created content will be even more valuable than it is today — if only as a source of pristine training data for AI.
These findings have significant implications for the field of artificial intelligence, emphasizing the need for improved methodologies to maintain the integrity of generative models over time. They underscore the risks of unchecked generative processes and may guide future research to develop strategies to prevent or manage model collapse.
“It is clear, though, that model collapse is an issue for ML and something has to be done about it to ensure generative AI continues to improve,” Shumailov said.