The New York Times has joined the AI Wars by suing OpenAI and Microsoft (complaint here). The case was filed while I was on my yearly holiday, so I am late to the party. Quite a lot has already been written on the subject by various parties. Depending on whom you listen to, this lawsuit could either completely destroy generative AI or represent the last act of an industry consigned to the past. As usual, I believe the truth may lie somewhere in between.
The case
The lawsuit comes after months of negotiation between the NYT and the tech companies, and follow agreements that were reached between OpenAI and Axel-Springer, as well as with the Associated Press. OpenAI did not offer enough, and reaching agreements with other organisations signalled the commercial importance of news, so the NYT felt that it had to sue for copyright infringement.
The basis for the lawsuit is now becoming familiar, both Microsoft and OpenAI admitted to having used the Common Crawl in their training data. Common Crawl includes sources such as Wikipedia, and most importantly, the New York Times. A 2019 snapshot of data included in the complaint calculates that NYT content amounted to 100 million tokens, these were used in training GPT-3, which consists of 13 trillion tokens, so the New York Times amounts to %0.1 of the entirety of GPT-3. However, the Times claims rightly that some content is given a higher weight because of its quality and importance, so the percentage is deceivingly low.
So it is undoubted that NYT content was included in the training of ChatGPT and LLaMa, so the case will rest on the legal question of whether such copying is infringing copyright or if it is fair use. So far this is quite similar to the other ongoing copyright cases. But there is a big difference. One of the biggest weaknesses in some of the previous lawsuits is precisely that they have been centred on the input question, namely the copies that are used in training a model. The New York Times case is the strongest one so far because it was able to show potentially infringing outputs. Included in the lawsuit are several examples in which ChatGPT was able to replicate paragraphs from NYT articles almost verbatim, which could definitely be a substantial reproduction needed to prove copyright infringement. This is a massive development, and sets the case apart from the others.
We have known for a while that generative AI models can sometimes reproduce some of their training data, this is sometimes referred to as memorization, but what is happening is better described as over-fitting. While trained models do not keep copies of their training data, sometimes they have seen a work so much that they can reproduce it substantially. This can be seen in images with famous art such as the Mona Lisa, but also takes place with popular characters, and even memes. In text this was also seen with some famous opening sentences, such as The Lord of the Rings or Harry Potter. What the Times managed here was to find some memorization of some of its articles, 100 examples in total.
This was one of the most eye-catching elements of the lawsuit, and it immediately caught the attention of commentators and experts. Having actual outputs would make a copyright infringement case much stronger, and some were even seeing it as the smoking gun, the Gen AI killer had arrived. I found these examples quite impressive, but I am less inclined now to believe that they will play an important part if the case ever leads to a decision. There are two reasons for this growing scepticism on my part. The first is that not all articles have been memorized, only very popular papers that have been replicated elsewhere, so the source of the memorization may not be the New York Times. The second is that some people have noticed that the prompts used to produce the allegedly infringing text may have included links to the actual articles, which would be more likely to produce the results witnessed, and it would have an effect on the importance given to these reproductions. It has also been pointed out that ChatGPT is more likely to hallucinate the content of an article as to reproduce it.
The end result is that this case is likely to be decided separately based on the inputs and the outputs. It’s possible that the inputs may be deemed fair use (following precedents like Google Books), while the outputs could be considered infringing. Alternatively, the inputs could be found infringing, but the outputs might be seen as non-infringing, especially if they are mostly the result of very specific prompts that the public cannot use. It’s also possible that everything could be ruled as infringing, or all as fair use. Anyone claiming to know the exact outcome is likely overconfident.
An aspect of the complaint that I found particularly intriguing, and which could potentially challenge fair use, concerns the ability to use chatbots to bypass paywalls. The complaint cites examples where ChatGPT has been used to produce summaries of articles behind a paywall. While summaries themselves do not infringe copyright, their use could demonstrate a commercial impact on the New York Times, which could influence the fair use analysis.
Wider implications
This case cannot be seen in isolation. It is not only the strongest case so far against Generative AI companies, including the Getty lawsuits, but it should also be considered in the context of the ongoing battle between traditional media and technology, a struggle that has persisted for two decades. This battle is unlikely to be resolved with this technology, and it will continue to be fought on the fringes of copyright law.
The first thing to note is that there is a good chance this case will be settled out of court as a negotiating tactic, rather than leading to an actual decision. Even if it reaches a jury, it will likely be appealed repeatedly, potentially becoming a protracted battle spanning years. And by the time a decision is reached, what will the scope of AI be? Thus, time seems to favour the tech companies.
Another aspect to consider is the relatively weak negotiating position of media companies. Despite my support as a paid subscriber to several newspapers, it’s undeniable that the newspaper industry is gradually declining. The internet has posed significant challenges to news media, and some might even welcome the downfall of mainstream media.
A significant portion of the complaint argues for the necessity of newspapers in a democratic society, a viewpoint I wholeheartedly support. However, the economic reality is that traditional media needs more income sources, leading them to approach tech companies from a disadvantageous position.
Additionally, as I have stated repeatedly, training an AI is not an exclusive right of the author. Trainers could legally purchase a copy of a work, use it to train an AI, and potentially not infringe on copyright, provided the trained model is not considered a derivative of the inputs, a theory that I believe lacks legal basis.
This case may bring to the forefront a legal argument already evident in the Getty Images suit in England: the non-infringing uses theory. This draws from the Sony doctrine in the US and the Amstrad case in the UK, suggesting that if a technology has substantial non-infringing uses, it cannot be deemed as secondary infringement, even if it can be used for that purpose (think DVD recorders). This could be a compelling argument, especially for secondary liability claims.
Ultimately, the outcomes may also hinge on the perceived utility and prevalence of the technology. The New York Times seeks to destroy all models trained with its content, but I doubt it will come to that, though I could be wrong. As someone noted on Twitter (apologies for not having the reference), judges’ decisions are influenced by personal experience, suggesting that the widespread use of ChatGPT by judges and their families might reduce the likelihood of the technology being completely abolished.
Furthermore, even if US cases are resolved unfavourably for generative AI companies, the global landscape is vast. It’s possible that other countries might seize the opportunity to become more accommodating to AI developments.
Concluding
This is a fascinating case, not only on the legal issues involved, but also on its role of the future of AI. We don’t know what will happen, but what is clear is that this could prompt many other media companies to initiate proceedings against tech companies. I have predicted 4-5 years of litigation until things settle down, and I think we’re on track for at least that prediction to hold true.
Eventually both industries may benefit from licensing schemes that benefit both parties, but we will have to see if such an approach prevails. I do believe strongly that generative AI is here to stay, and the quicker we solve the outstanding issues, the better.
I asked ChatGPT for a closing joke for this blog post, and it gave me this:
“Media companies and tech companies in court is like watching two keyboards argue, one stuck on ‘print’ and the other on ‘delete’!”
That’s not bad, actually.
10 Comments
Anonymous · January 5, 2024 at 10:37 am
Hi Andrew,
Thanks for sharing your view.
I don’t believe that this is correct:
“The second is that some people have noticed that the prompts used to produce the allegedly infringing text included links”
In Exhibit J, the URLs are included for the sake of the reader, its clear that they are not part of the actual prompts.
I also wanted to ask you, don’t you think it makes a difference to the fair use defense that these millions of NYT articles were placed behind a metered paywall? It was not (or at least not supposed to be) public internet data, but proprietary data.
Cheers
Anonymous · January 5, 2024 at 10:38 am
*Andres
Andres Guadamuz · January 5, 2024 at 11:44 am
Thanks, the prompt is not included, as Mike points out. If the link was fed, that changes things.
Anonymous · January 5, 2024 at 2:29 pm
Hello from France, very interesting POV on this case. I agree with 98% of your analysis. I would be very surprised if an actual trial ever takes place and even more if an anti-tech definitive decision ensues.
“Media companies and tech companies in court is like watching two keyboards argue, one stuck on ‘print’ and the other on ‘delete’!”
That ChatGPT generation might be funny (didn’t make me laugh though, nor smile), allright, but not relevant : while the media might say ‘print’, what’s the logic of tech saying ‘delete’. In fact, both words should come from the media. The media say ‘we print’ and to the tech, they say ‘delete”.
Gen-AI doesn’t reason.
Andres Guadamuz · January 7, 2024 at 12:00 am
I took this as the one saying “delete” are the media, while those saying “print” were the tech companies, but I see your point.
Licensing research content via agreements that authorize uses of artificial intelligence | Authors Alliance · January 10, 2024 at 3:31 pm
[…] possible typically only when a training corpus is rife with numerous copies of the same work. And a recent case filed by the New York Times addresses this potential similarity problem with generative AI […]
The Metaverse UA Newsletter #5 - January 2024 - Cátedra del Metaverso · January 29, 2024 at 8:40 am
[…] data published by the former for the training of ChatGPT and LLaMa. This case, however, presents a distinct feature that could significantly impact the outcome of the litigation compared to other similar […]
The MetaverseUA Chair Newsletter #5 - January 2024 - Cátedra del Metaverso · February 23, 2024 at 10:41 am
[…] data published by the former for the training of ChatGPT and LLaMa. This case, however, presents a distinct feature that could significantly impact the outcome of the litigation compared to other similar […]
Looking ahead | Fair Duty · March 1, 2024 at 2:20 pm
[…] use of their content as training data. Even though, as prominent legal scholar Andres Guadamuz writes, “training an AI is not an exclusive right of the author.” While some will be pleased at the […]
Canadian Media Companies Target OpenAI in Copyright Lawsuit But Weak Claims Suggest Settlement the Real Goal - Michael Geist · December 5, 2024 at 12:31 pm
[…] Instead, Common Crawl, a non-profit started in 2007, did. The works from the NY Times amounted to 100 million tokens in that data set, which sounds like a lot but is actually a tiny fraction of the total. You can […]