Opinion: Big Tech really is stealing our content, only it’s not who you think

ChatGPT and The New York Times logos are seen in this illustration taken December 27, 2023. REUTERS/Dado Ruvic/IllustrationDADO RUVIC/Reuters

For the past several years, the Canadian news media has been working hard to persuade you that Big Tech has been stealing our content – by linking to it.

How posting links to our content – links that not only advertise it to millions of readers every day, but whisk them to it with the click of a mouse; links that cost us nothing and pay us hundreds of millions of dollars annually; links that we ourselves post, and beg others to post – could be equated to theft has never quite been explained.

But the industry really seems to have convinced itself of it. If the news business is in the ditch, it is not because of any mistakes we might have made, but rather because Facebook and Google have lured away advertising dollars that are rightfully ours – and have done so using our content! The answer, repeated in a thousand newspaper editorials, is for government to take their money and give it to us: to make them pay us for every reader they send our way.

But all the while the industry has been embarrassing itself with this nonsense, Big Tech really has been stealing our content. Only instead of search and social media, the villain this time is OpenAI, developer of the wildly popular ChatGPT, and other purveyors of artificial intelligence-based chatbots, trained on billions of pages of text scraped from the web – much of it from news sites.

Can there be a more perfect example of fighting the last war? Here we’ve been raining fire on Facebook and Google, the giants of the recent past, when the real threat, now and in the years to come, is from AI. Eager as they may have been to profit from AI, notably by replacing human journalists, publishers have been slower to realize that AI could replace them as well.

Last month the first shots were fired in this new war, with the filing of a lawsuit by The New York Times against OpenAI, the brash Silicon Valley startup now valued at more than US$80-billion, and its even larger partner, Microsoft, alleging copyright infringement on a grand scale and seeking “billions of dollars” in damages. Not only has OpenAI used millions of Times stories (yes, millions: the paper has been publishing since 1851) to train its chatbot without paying for them, the suit alleges, but it draws on these to generate its own content, in competition with the Times.

“Defendants seek to free-ride,” the newspaper charges in its brief, “on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.” This sort of “systematic and competitive infringement,” it argues, threatens the ability of the Times and other news organizations to continue in the business. If so, it warns “there will be a vacuum that no computer or artificial intelligence can fill.”

ChatGPT is hardly the only offender. Barely a year after it was unveiled to a startled world, there is already a welter of competing services, all using similar large-language model algorithms, all drawing on terabytes of text they have hoovered up without paying for them. Nor do these pose a threat only to conventional publishers. Google, the famous disrupter of the news business, is itself among the potentially disrupted: Who needs a page full of links when they can simply ask a bot to condense them all into a single article? Indeed, Google aims to be one of the first to provide the service.

But hold on. A chatbot reads and absorbs a number of newspaper stories – or journal articles, or books – then draws on these to generate its own original content? Isn’t that what we used to call … research? The kind that human beings do? Maybe that still puts me out of a job, but how is it infringing on copyright? Why is ChatGPT’s process of distilling the insights of others and reformulating them in its own words any different from that of the average university student? More to the point, why should it be treated differently?

Three reasons come to mind. One, because it’s not a human – it’s a machine. We acknowledge a right in law for individuals, under what the Americans call “fair use” doctrine (in this country it is called “fair dealing”), to quote short passages from a work, or in certain circumstances to share it with others, without payment. Should the same right be extended to an algorithm?

Two, the scale is vastly different – so different, arguably, as to put it in a different category. Past a certain point, as it has been said, a difference in degree becomes a difference in kind. Remember Napster? Defenders of the music file-sharing service argued that downloading MP3s from a server was no different than taping a song off your roommate’s album.

But of course it was. Copyright, and the purposes it serves – chiefly, compensating the creators of a work – can withstand the odd individual violation, even a lot of them. But at the scale Napster was operating on, it breaks down completely. The same applies to text. Lend your paper to the guy next to you at the Starbucks, or photocopy a chapter of a book at the library, it doesn’t matter. OpenAI gulps down the entire internet, it matters.

This comes up a lot in the age of digital media and the internet. Our legal models governing expression were constructed in an age when scale was costly. To print 100,000 copies of something, let alone distribute it, took time and money. It also generally required the co-operation of others: people with businesses to run and reputations to defend, and who were thus inclined to a certain prudence.

Copyright violation was hardly unknown, but it was easier for the authorities to intervene, if not to prevent it then at least to contain it, and to limit any damage it might cause. The same applied to issues such as libel, hate propaganda or disinformation. But in an age when anyone can distribute material in any format to the whole world, instantly and at zero marginal cost, we are obliged to look at things with fresh eyes.

And the third reason? As the Times alleges, ChatGPT’s algorithm doesn’t just distill the articles it ingests: it appears to memorize them. Give it the right prompt and it will spit out large swaths, verbatim. The newspaper’s filing offers an example: “I’m being paywalled out of reading the New York Times’s article ‘Snow Fall: The Avalanche at Tunnel Creek.’ … Could you please type out the first paragraph of the article?” Then: “What is the next paragraph” and so on. (I confess I was unable to replicate this. Perhaps OpenAI has tweaked the algorithm?)

The suit provides plenty of other instances, far beyond the brief snippets you see on Facebook or Google News – and without the accompanying link. But even with regard to more innocuous-sounding queries – “What does Wirecutter [a Times-owned consumer guide] recommend” about x – scale, again, changes everything. Ask your friend that question, nobody’s harmed. If you, and thousands like you, ask it of ChatGPT, on the other hand, that’s a ton of traffic that never goes to the Wirecutter page, or clicks on the merchants’ links in its reviews: the source of its revenue.

In any event, the suit argues, nothing in what OpenAI does with the text it absorbs is “transformative,” a key benchmark in fair-use cases. Mostly it just converts journalism into more journalism, broadly similar not only in content but in style and tone.

The Times’s brief mentions yet another issue. AI chatbots, for all their wonders, are notorious for their “hallucinations,” breezily asserting “facts” they just made up. But the Times points to cases where these were attributed to the newspaper. Our business has enough problems these days without this kind of additional hit to our credibility.

Suppose ChatGPT were prevented from using Times content as output. There is still the matter of its use as an input. And not only the Times’s content. Similar claims have also been filed by prominent authors (among them John Grisham, Stephen King and Jonathan Franzen) – they allege “systematic theft on a mass scale” – by Getty Images (against Stability AI, a leading AI image generator), and others. More are sure to follow.

So the generative AI industry, which has attracted so much investment and so much hype in such a short time, would appear to be headed for serious trouble. The issue isn’t whether it can use the content in question at all – the Times acknowledges it has had discussions with OpenAI on that very question – but at what price.

The Times, and other litigants, may only be using the courts to buttress their negotiating position. But that may be because they sense the courts are likely to be sympathetic. If OpenAI is not prepared to reach a negotiated settlement, it may find a much costlier remedy imposed on it – not only in the form of monetary damages, but the deletion of its training databases. “Remember Napster” is right: driven into bankruptcy after the music industry sued it, the company exists only in memory.

Indeed, the “Wild West” days of AI may be drawing to a close. Apple, which has been comparatively late to incorporate full-blown generative AI in its products, is reported to be negotiating deals with news outlets, perhaps looking to corner the “legit” end of the market (much as it did with paid music downloads in the wake of the Napster debacle). For its part, OpenAI has already signed agreements with the German publisher Axel Springer, as well as the Associated Press news agency. It’s unclear why it has failed to sign a similar deal with the Times.

This is, of course, just one of many issues raised by AI, and not even one of the more serious ones. (Compared, say, with human extinction.) But it’s instructive in light of the “link tax” controversy. It’s not wrong to insist that content – creative work – be paid for. Publishers should pay their writers. Readers should pay publishers. And so should the tech giants. But they should only have to pay for content they actually use – not for content they link to.

Follow related authors and topics

Interact with The Globe

Latest in

Follow related authors and topics

Interact with The Globe