A few weeks ago, The Atlantic published an article titled “These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech“. This article discussed a dataset used to train some large language models, which has sparked controversy due to the inclusion of a significant number of suspected pirated book copies. Accompanying the article is a search tool that enables authors to check if their books are part of the dataset, a feature that has elicited angry responses from many authors whose works were included without their consent. This dataset has since triggered several lawsuits, including a highly-publicised one by comedian Sarah Silverman against Meta and OpenAI, the Author’s Guild against OpenAI, and most recently Mike Huckabee v Meta.

None of my books were included in this particular dataset, and personally I would not mind if they had been in used in training, but many other people object to this and have complained strenuously on social media. So what should authors do? And how does this development fit with the ongoing legal battles regarding copyright and generative AI?

Books3

What is this dataset that contains pirated books, and why is it being used to train AI? The now infamous dataset is called Books3, and it was created by a non-profit open source research group EleutherAI, whose stated purpose is to break the domination of large tech companies in the research of machine learning, this is done by providing tools that are available for everyone to download. One of such tools is The Pile, a large open dataset which contains 22 other smaller datasets with content such as  web crawls (Common Crawl and OpenWebText), PubMed, ArXiv articles, Wikipedia, the USPTO, Project Gutenberg, and Books3, which has a weighting in the entire dataset of 12%. Books3 is is a dataset that contains fiction and non-fiction books collected in a torrent tracker called Bibliotik.

So it is uncontested that Books3 contains infringing copies of a multitude of works, but what does this have to do with companies like Meta and OpenAI? Well, we know for sure that two book sources contained in The Pile were used in the training on Meta’s own LLaMa large language model, namely the Gutenberg dataset (public domain works), and Books3. This is stated in the LLaMa paper, and those two sources make up 4.5% of the entirety of the trained model. We don’t know for sure if Books3 was used in any other of the models used by other companies such as OpenAI and Google, but it has long been suspected that a dataset used by OpenAI in the training of GPT-3, called Books2, may also contain infringing copies of books.

I need to pause here to make a statement about the inclusion of these datasets in the training of large models by tech companies. I am strongly in favour of opt-outs for training models, and the wishes of authors, artists, and other creators should be respected. I am also on record as stating that in many circumstances, I believe that training should normally fall under fair use/fair dealing, with opt-out caveats and other possible exceptions. However, I cannot believe that tech companies were so arrogant as to think for a single second that including a dataset containing pirated books would result in anything other than widespread litigation against them. Even if they strongly believed that what they were doing constituted fair use, and there are reasons to think so, opening themselves to liability in this manner seems reckless at best, and foolhardy at worst. Especially because I believe that it would be relatively inexpensive to stay on the right side of the law, as I will discuss later.

Is legal action viable?

Many authors whose books were contained in the dataset may be considering legal action for copyright infringement, and therefore should be talking to their lawyers instead of reading this blog post. But some people may not want to, or may be unsure as to what to do next.

There are quite a few considerations when analysing this situation. On the face of it, it would appear to be a very straightforward case of copyright infringement: copies of books have been made without the permission of the authors, and those copies have been used to train large language models. Case closed, wait for the money to roll in… Except that I don’t think it’s completely straightforward, and these cases may be litigated for years to come.

The first thing to consider is who made the copies in the first place? For sure EleutherAI appears to be the likeliest party infringing copyright in this case, they have knowingly used a dataset containing infringing book copies. But EleutherAI is not a very good target for litigation, it’s a loose group of machine learning enthusiasts that started life as a Discord chat group, and grew into a self-described research organisation. Even if they were to be found infringing, the chance of obtaining any monetary recompense would be minimal. At best what can be expected is to obtain an injunction to order a stop to any distribution of the infringing dataset. Surely it will be made available by other illegal means, but at least the official distribution would stop.

The second party involved, and one presenting a juicier target, is Meta itself. We know for sure that they used Books3 in some of their training, so they have infringed copyright, correct? Here things in my opinion become a bit trickier. I don’t think for a second that the in-house counsel for these companies would have given a green light to the use of Books3, even if you thought that this was fair use, and that would have been a big gamble anyway. So to my mind there are two options: the researchers didn’t consult their lawyers because they assumed that Books3 was fair use, or there is a small possibility that they were not aware that Books3 consisted of infringing copies of works. I have no idea, but discovery should be fun in these lawsuits.

However, even if the copying performed by EleutherAI may be a straightforward case of copyright infringement, I think that subsequent uses by other companies may prove to be less actionable. The exclusive rights of the author tend to vary according to jurisdiction to jurisdiction in some detail or another, but in most places copyright holders get the right to reproduce, make derivatives (adapt), lend, publish, display, perform, communicate to the public, amongst others. This means that in order to perform any of those actions, you need permission from the owner.

This is where I think the first line of defence will take place, and this is already been a part of some of the early motions to dismiss, although we have only gotten glimpses of it. The argument goes something like this: the books were not copied by Meta, they took a publicly available dataset, and it was used to train roughly 2.5% of the model internally, this means that the books have not been published, or been made available to the public, and the resulting models are not derivatives of the books. Under this argument, defendants will claim that what they have done amounts to fair use because the resulting models do not contain the copies of the works. Furthermore, they´re likely to argue that the resulting works are not in commercial conflict with any books in the dataset, you wouldn’t use LLaMa or ChatGPT to read the entirety of A Song of Ice and Fire. Meta uses this argument in their motion to dismiss in the Silverman case, when discussing the derivative issue:

“The fact/expression dichotomy was further elucidated in Authors Guild, in which the Second Circuit rejected an argument that the Google Books project—for which Google made digital copies of millions of books without permission to create a tool allowing Internet users to search for certain words or terms within them—constituted an infringing derivative work. The court reasoned that plaintiffs had no “supposed derivative right to supply information about their books,” such as “word frequencies, syntactic patterns, and thematic markers.” This “statistical information,” the court found, does not constitute “copyrighted expression,” and its use by Google did “not support Plaintiffs’ derivative works argument.””

Will these argument fly? I have no idea, we’re about to find out, and I think it could go either way. I really think that the fact that the books are pirate copies will look very bad, particularly with jury trials, but I also think that the amount of what each book is worth in the overall training of the model to be a persuasive argument towards fair use. I wouldn’t want to call this one.

If you’re an author whose book is included in the dataset and you’ve come across this information, you might be contemplating legal action for copyright infringement. If you’re considering this, it might be best to join one of the ongoing class action lawsuits; the one by the Authors Guild appears especially strong. The rationale for collective action having a higher likelihood of success is that each individual book has negligible impact on the trained model. Thus, a collective effort by multiple authors could present a more compelling argument

Outputs and derivatives

What I have described so far relates to the input phase, that is, the training of the model. It is clear that in some instances there has been copying, and the lawsuits will try to determine if that is fair use/fair dealing, or if it is copyright infringement. But there is something missing from most of the cases dealing with generative AI, and it is evidence of an infringing output. So if you ask for a language model to reproduce the contents of a book, it won’t do it, at best it will be able to give a summary of a book, but not verbatim passages. While some research has managed to reproduce some passages from extremely popular and well-cited books, these are limited. Needless to say, book summaries are not copyright infringement.

The reason for this is that trained models do not keep copies of the works in the inputs, LLMs aren’t search engines, training a model extracts tokens from the works, and a language model is, at its very core, akin to a highly advanced digital wordsmith, it learns the art of human language by analysing vast amounts of text, enabling it to generate coherent content.

This is where most of the existing lawsuits try to allege that all of the outputs produced by an LLM are actually a derivative of all of the inputs, a position that I do not think holds water, and would require a complete twisting of copyright law. Imagine that you ask the LLM to check the grammar on an email that you wrote, the contention appears to be that the email it writes is a derivative in some way of all of the books in Books3, and that would also mean that the email would be a derivative of any other text in the input, including web scraped material, such as this blog. The argument is that all outputs are derived of every input, even if there is practically nothing of the original in the model. That to me makes no sense.

The insistence on making the derivative argument is likely because the ‘fair use’ input contention remains unresolved. However, I also reckon that individual authors might struggle to secure substantial damages based solely on the inputs. This difficulty arises from the nature of the training process. As previously noted, the books are neither distributed nor sold; any potential infringement occurs internally. Moreover, the nature of this infringement could prove fascinating. Datasets are handled as bulk text, not on an individual basis; the copy exists within the dataset, and it’s the language data that’s extracted — not the meaning, but the probability of one token following another. These copies can subsequently be discarded. Fundamentally, the only exclusive right of the author breached is that of copying; extracting information from a text does not constitute an infringing act.

This is again why I think it is a bad strategy for tech companies to use datasets containing books without permission, I think that it would actually be relatively cheap for them to remunerate authors, this is because all you need is to purchase one copy of a book. Training a model is not an exclusive right of the author, extracting information from a work is not an exclusive right of the author, so all you need is to purchase a copy, and that would be enough, in some ways training is indeed like someone buying a book and learning from it, authors do not have the exclusive right to stop people from reading their books.

My guess is that we’re heading towards some form of compensation and opt-out scheme. We may get quite a few settlements out of court, and we may even get a few decisions, but I  think that what makes more sense at the moment is for trainers to pay some sort of licensing, or make agreements with large publishers to gain access to large amounts of text. After all, you only need to do it once per model training.

Concluding

Nobody knows what the next few years will bring. Whenever I am asked to make an educated guess, I say that nobody knows what will happen. However, this period reminds me a bit of some other eras of technological advancement, people forget that the birth of the internet was met with a large number of cases against service providers, then we had the P2P wars, and then the intermediary liability litigation. Each big era of modern technology has gone through something similar, my guess is that at some point things will quiet down. But on the meantime, we will probably see an exploration of some very interesting legal issues.

Authors who have had their works trained without their consent may join some of the ongoing cases, it will be interesting to see if we witness more lawsuits. I haven’t even discussed the issue of jurisdiction, it is possible that authors in the UK and Europe may also join in.

Once more I am reminded that “may you live in interesting times” is a curse. And what interesting times we have.


2 Comments

Aussie Authors Could Lose In AI-Related Copyright Theft Of 183,000 Books – channelnews · December 1, 2023 at 11:36 am

[…] “The argument goes something like this: the books were not copied by Meta, they took a publicly available dataset, and it was used to train roughly 2.5% of the model internally, this means that the books have not been published, or been made available to the public, and the resulting models are not derivatives of the books,” Dr Guadamuz says in a blogpost. […]

Authors could lose in AI-related theft of almost 200,000 titles – Chris Griffith · January 2, 2024 at 11:00 am

[…] “The argument goes something like this: the books were not copied by Meta, they took a publicly available dataset, and it was used to train roughly 2.5% of the model internally, this means that the books have not been published, or been made available to the public, and the resulting models are not derivatives of the books,” Dr Guadamuz says in a blogpost. […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.