One of the buzziest research papers in the world of artificial intelligence emerged from an idle lunchtime conversation between Ilia Shumailov and his brother Zakhar, as they wondered whether it will be easier or harder to train large language models in the future.
The internet is awash in the text data used to train LLMs, which underlie chatbots and other AI applications. The output of these models, such as text from ChatGPT, could also be used to train future AI models. But it might not be entirely suitable. “We were basically sitting and chatting, and on a piece of paper, trying to map out what the proof would look like,” recalled Ilia Shumailov, who is a former research fellow at the Vector Institute in Toronto and junior research fellow at the University of Oxford.
The two brothers turned the question into a formal study, along with University of Toronto assistant professor Nicolas Papernot, and others. They reached a startling conclusion: Training AI models on AI-generated data renders them useless. Text models spout gibberish, and image models barf garbage. They dubbed the phenomenon “model collapse.”
On the surface, the findings are alarming. Generative AI models need massive amounts of data to find patterns, build associations and output coherent results. Today’s LLMs have already been trained on wide swaths of internet content and need fresh data to improve. The conclusion that AI-generated data will pollute future models, just as lead coursing through the bloodstream turns the human mind and body to mush, is worrisome, to say the least.
Abeba Birhane, a senior fellow in trustworthy AI at the Mozilla Foundation, wrote on X that model collapse is the “Achilles’ heel that’ll bring the gen AI industry down.” Ed Zitron, who pens a popular Substack often expounding on the shortcomings of generative AI, wrote, “It’s tough to express how deeply dangerous this is for AI.” Gary Marcus, another generative AI critic, wrote on X, “So hard to tell whether AI systems are sucking on each other’s fumes, in a way that could ultimately lead to disaster,” accompanied by a sarcastic eye-rolling emoji.
But Dr. Shumailov isn’t quite so pessimistic. Moreover, an early version of the paper was released last year and the updated version was published in Nature at the end of July. During the interim, other researchers have not only looked at ways to prevent model collapse, but how to use AI-generated data to improve performance.
“I’m sure progress will continue. I don’t know at what scale,” Dr. Shumailov said. “I don’t think there is an answer to this as of today.”
What model collapse could portend, however, is more complexity and cost when it comes to building LLMs, which is unwelcome news given that generative AI is already expensive and the financial returns uncertain. The debate around model collapse also shows that at a time when generative AI is massively hyped and some companies are spending big in hopes of seeing huge productivity gains, there is still a heck of a lot we don’t know about how this stuff works.
Dr. Shumailov and his co-authors contend that model collapse is a problem as more and more AI-generated content finds its way online. Because AI companies routinely scrape the internet for data, synthetic text and other media will invariably get swept up into the digital maw to feed new models, if it hasn’t already.
To see what would happen in this scenario, Dr. Shumailov and his colleagues fine-tuned an LLM on its own text outputs, over and over again. After a few cycles, the model was vomiting nonsense: “architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
Under another scenario, they tried a mix of authentic and AI-generated data, a far likelier possibility in the real world. That led to “minor degradation,” which is still a problem. LLMs with that level of performance probably wouldn’t even be released. “They will not pass quality controls,” Dr. Shumailov said.
The paper describes the technical reasons for model collapse, though the simplest explanation is that mistakes are encoded and compound over time. Metaphors are helpful here, too, such as a snake eating its own tail, or inbreeding.
Information that is not well represented in the original data set gets lost, too. Another study that Dr. Shumailov contributed to, released earlier this year with University of Toronto researchers, found that bias can be amplified over just a few training cycles.
As a result of these issues, the rate of progress in AI development could slow down, as companies will no longer be able to indiscriminately scrape data from the web. “The advancements that we’ve seen thus far, maybe they’re going to be slowing down a little bit, unless we find another way to discover knowledge,” Dr. Shumailov said.
For the past few years, the overriding principle in AI development has been scale: compiling lots of data to build bigger and better models, fuelled by lots of computing power. That principle breaks down if AI-generated content proliferates, said Julia Kempe, a computer science professor at New York University who has studied the issue. “With scaling laws, when we double the amount of data, error rates should go down,” she said. “But if the data is generated by some other model and you want to scale that model up, it just won’t work.”
So does this mean the end of generative AI? Far from it. “This is not as catastrophic as some people are happy to say it is,” said Quentin Bertrand, who until recently was a postdoctoral researcher at the Mila AI institute in Montreal.
In a paper first released last September, he and his colleagues replicated model collapse, and their study contains images of a man that were produced by a model trained on its own outputs. After 20 cycles, the man’s face fuses with the background and appears to sport a beard of white mould.
But more importantly, the study found that if the quality of synthetic data within a training set is good enough, and the proportion of original content is large enough, then model collapse will be avoided. Dr. Kempe and her colleagues came to a similar conclusion in a separate study this year.
Mr. Bertrand is also skeptical that the amount of AI-generated data on the internet is enough to corrupt future models. “From our experiments, you need a significant amount of synthetic data to observe degradation,” he said. “The amount of synthetic data online is still very small.” The authors of the model collapse paper, he noted, used a lot of AI-generated data when running their experiments.
What’s more, AI-generated content found online can be quite good. When people post images online that they made with AI applications such as Midjourney, for example, they’re likely publishing the best results. “What you’re putting online might not be garbage,” he said.
Synthetic data have also been shown to improve AI models in certain settings. DeepMind’s AlphaGo Zero learned to master the game of Go by competing against itself, without any data from sessions played by us mortals. Meta Platforms Inc. also used synthetic data to improve the coding abilities of its latest LLM.
“The problem with this approach is that it is hard to get it right,” said Pablo Villalobos, a senior researcher with Epoch AI, which studies trends in machine learning. Mathematics, games and computer coding are domains with clearly defined rules and correct answers. That’s not the case with a whole range of tasks that companies are hoping to use LLMs for, such as dealing with customer service inquiries or designing websites, where quality is more subjective and harder to assess. “It’s going to take a lot of work, and trial and error,” Mr. Villalobos said.
AI developers could have to devote more resources to filtering out the kind of synthetic junk that could harm models while prioritizing the data that will help, all of which could add cost. AI companies already rely on legions of typically underpaid workers to annotate, rank and label data for models to learn from, and that process will only become more important. “We will continue to need humans to be arbiters of our data,” Dr. Kempe said. “We should appreciate our data labellers and pay them well.”
The potential for model collapse emphasizes the value of original, human-created data, too. Companies such as OpenAI are already striking financial deals with news publishers and other content producers for access to material.
Dr. Shumailov expects large AI developers to be able to weather these burdens, given their ample resources. “Small players are going to suffer a lot more,” he said. Generative AI is already the domain of big, well-funded companies, and that’s unlikely to change.
As for the debate about the scale and implications of model collapse, he welcomes the discussion. “Before we released that work,” he said, “nobody was even talking about this.”