The news industry’s problem has also been my problem. For the past seven years, I ran a team at Google focused on making the web ecosystem more hospitable to news publishers. We built products to make the production of expensive journalism cheaper (giving them cutting-edge AI document analysis and transcription tools), to make it easier for people to buy subscriptions, and to let publishers showcase their editorial viewpoints and thus find their audiences more effectively. In aggregate, these things delivered billions of dollars of value to publishers around the world.
But they did not fundamentally alter the fact that the internet had hollowed out the value of the daily newspaper. Back in the day, if you wanted to know a sports score, a stock quote, a movie showtime, where the garage sales were or what concerts were coming up, you looked in the newspaper. Now, the web allows you to find this information more quickly elsewhere. So, if consumers once had 20 reasons to buy a newspaper, now they had only one: news — the labor-intensive, expensive work of reporting and writing the news — which isn’t a thing advertisers are especially excited to be associated with.
To combat this turn of affairs, news publishers, first in Europe but increasingly around the world, began turning to regulators and legislators to restore their past dominance — or at least their profitability. And I had to figure out how Google would respond to these demands.
The publishers’ complaints were premised on the idea that web platforms such as Google and Facebook were stealing from them by posting — or even allowing publishers to post — headlines and blurbs linking to their stories. This was always a silly complaint because of a universal truism of the internet: Everybody wants traffic! Just look at the time and money publishers spend putting their links and content on those platforms — paying search-engine optimization companies and social media managers to get more links higher on the page. We found ourselves in the disorienting situation of having one team from a publisher charge, “You are stealing from us by placing our results on your site,” while another team complained, “It’s critically important to us that you place our results on your site more often and at higher levels of prominence!”
This is not to say that news publishers had no legitimate complaints: Until 2017, Google would rarely link to stories behind a paywall, which was crippling to the subscription model that web publishers were coming to rely on. The selection of news results was imperfect, sometimes placing a site that had done painstaking original reporting below a less authoritative site that had done a quick rewrite of that scoop; and many readers were only interested in scanning the headlines and didn’t click to read the actual story. Google fixed the first of these, made steady progress against the second and is powerless to solve the third — a battle that cover designers and front-page editors had been fighting for decades before the web.
In any event, regulators pursued the illegitimate complaint: the idea that platforms should pay publishers every time they display a headline/blurb or sometimes even for the act of linking itself. As these regulations or threats of regulation spread around the world — Europe, Australia, Indonesia, Brazil, Canada — I spent more and more time preparing to disable news products, or disabling search, or building accounting systems to count “snippets” and calculate payments. That meant I spent less time giving journalists research and transcription tools, or building mechanisms to help retain subscribers.
As for Facebook, each year, its traffic to news publishers plummeted. It is a well-known economic fact that when you take a thing with an established market price and impose a fixed price level above that, demand goes down. Prior to these laws, no one ever asked permission to link to a website or paid to do so. Quite the contrary, if anyone got paid, it was the party doing the linking. Why? Because everybody wants traffic! After all, this is why advertising businesses — publishers and platforms alike — can exist in the first place. They offer distribution to advertisers, and the advertisers pay them because distribution is valuable and seldom free.
While this sideshow was going on, we would hear how much closer large language models (LLMs) had gotten to reproducing human-level composition. Then LLM-based features began to show up in multiple products — grammar checking, autocomplete, etc. — and actually worked. To me, watching publishers bicker about payment for search results while LLMs advanced at a silent, frenetic pace was like watching people squabble about the floral arrangements at an outdoor wedding while the largest storm cloud you can imagine moves silently closer.
And then, like a thunderclap, ChatGPT launched and put everything in stark relief. The problem has never been that platforms post links to news articles — that’s what they should do. The problem is that new technology has created a landscape where they might not need to link to news sites at all — they can just take the news, have a robot rewrite it and publish it in their own products.
And, for me, the world turned suddenly upside down. The absurd demand of news publishers — “send me traffic and then pay me for having done so!” — would soon be eclipsed by an equally absurd proposition from the tech industry: “How about we build a product on your content and send you little or no traffic in return?” In the long run, neither of these irrationalities can stand. They’ll either wither away because of their own economic absurdity or end up in the crosshairs of courts, legislators or regulators.
But having seen firsthand the feckless way in which regulators lined up behind the first of those propositions, I’m bracing myself for how they’ll handle the second. The stakes couldn’t be higher. On one side of the conflict sits existential risk for the publishing industry; on the other, existential risk for technological innovation.
First come the courts. The New York Times fired the opening salvo in December in a suit charging OpenAI and Microsoft with violation of its copyright, starting with the use of its documents in training OpenAI’s LLMs.
It seems quite plausible that the tech companies will win this first round. AI products transform text into geometric relationships that are fundamentally different from the news stories they came from, and these mathematical “vectors” cannot be substituted for those original stories. In other words, LLMs seem to pass the tests for fair use.
Only when you put an LLM into a consumer product such as a chatbot or search engine do you see it potentially infringing on copyright. An LLM, after all, can produce variations on any text. But even then, while those variations very clearly can substitute for the originals on which the model was trained, they are indeed variations — akin to the sort of human rewrites that publishing companies do all the time. (Note that the Times’s recent suit presents evidence of ChatGPT reciting paragraphs of text from Times content — clearly a copyright violation — but this can be easily fixed, just as human rewriters can be trained not to repeat text verbatim from other sources.) Moreover, no one can own a copyright to mere facts. And yet, if one cannot, then how can the rights of content producers be protected?
The answer, I think, lies in the fact that LLMs tend to hallucinate — make up things that aren’t real — and that they are so expensive to train that the models are updated on the order of months, rather than days or minutes. As the Times points out in its suit, generative AI products tend to rely on a process known as “grounding,” in which the statements made by the AI are checked against relevant source documents to ensure that the AI is not making things up. This process is especially critical if a user is asking about a recent event in which the relevant facts did not exist at the time of the LLM’s training. In such cases, the AI can only answer accurately if it retrieves those facts from recent grounding documents. These documents are the essence of the work newspapers do — sourcing and reporting new facts — and the fruits of that labor should reasonably belong to those who perform it.
The courts might or might not find this distinction between training and grounding compelling. If they don’t, Congress must step in. By legislating copyright protection for content used by AI for grounding purposes, Congress has an opportunity to create a copyright framework that achieves many competing social goals. It would permit continued innovation in artificial intelligence via the training and testing of LLMs; it would require licensing of content that AI applications use to verify their statements or look up new facts; and those licensing payments would financially sustain and incentivize the news media’s most important work — the discovery and verification of new information — rather than forcing the tech industry to make blanket payments for rewrites of what is already long known.
Such legislation would provide publishers new opportunities to generate revenue. If LLM training is indeed held to be a fair use but grounding is not, the publishers’ ability to verify the information or infuse it with up-to-date facts becomes not merely valuable but potentially differentiating for their own products. A small, local media company would be able to license its local articles and factual information to generative AI services, but a large media company might choose not to. It might rather offer its subscribers a differentiated AI service of its own, perhaps based on OpenAI or Google APIs, but enriched with proprietary information not available to other providers. Such a service might be more timely, comprehensive and relevant to its subscribers than the tech vendors’ own products, and would enable publishers to extend their services back into categories of information they haven’t effectively competed in since the print era.
If a court decision or congressional legislation were to rewrite the rules as described, what would the new media world look like? First, to take advantage of the new framework, media companies would need to understand that consumer expectations are about to change dramatically.
In the print era, publishers created “articles,” printed them on paper and distributed that paper to their readers. The web changed everything about the distribution and the literal paper, while the articles remained mostly untouched. But in the future, publishers will have to think less about those articles and more about conversations with users. The users will interact less and less with the actual articles and instead talk about the articles with what the tech industry used to call “intelligent agents.”
Back in the 1990s, Microsoft introduced Clippy — a simpering, eye-batting paper clip who interrupted you at inopportune moments to ask you whether you needed help. Microsoft put Clippy out of his misery long ago, but as is so often the case, the technology finally caught up to the idea.
The new breed of LLM-powered Clippy is going to do all the things Microsoft hoped it would in 1996: brief you on the news, your day, your emails; respond for you; answer your questions; help with your work. One morning, it might let you know that “The Washington Post announced it has launched a new AI assistant, called Marty.” As you ask for more info, it says, “Why don’t I just ask him to join us right now since you’re a subscriber.” Marty joins the conversation and gives you a roundup of The Post’s latest coverage, responds to a question you have with a relevant info graphic, updates you on some political gossip and recommends a newly reviewed TV series based on your interests. (Because you’re a subscriber, he knows what you like.) “Can you find me a restaurant for Thursday night?” you ask, and Marty gives you some of the best local options and what they’re known for, and he notes that he can offer you a discount at one of them. Maybe you decide to make Marty a part of your daily briefing or, on the other hand, maybe you turn to your ChatGPT agent and ask, “So what do I need you for?” She might say, “I can do things like make travel arrangements,” to which Marty responds, “We have a travel agent we work with, as well. Shall I ask ExpediaBot to join?” Welcome to your new daily newspaper.
The details could turn out very differently, of course. It depends on the outcome of these current copyright disputes and on the ability of publishers to envision a future that looks very different from their past. But one thing is certain: As with the web 30 years ago, those details will determine whether the news business reclaims its status as the premier vendor of reliable information or falls into a final, unrecoverable decline.