The copyright wars are back, and this time the conflict is all about artificial intelligence. While most of the public has been paying attention to what is going on with AI art tools such as DALL-E, the current stage of AI development started with text tools, particularly GPT-3 and the code-writing marvel that is Github’s Copilot. I have written about some of the copyright implications of both tools before (here and here), but in case you don’t want to go through two blog posts, Copilot is an AI tool that writes code based on prompts. The program has been trained on the corpus of code submitted to the open source software repository Github, and it uses OpenAI’s Codex.

Almost from the start, Copilot has proven to be controversial, some people complained that this was a violation of open source principles (and potentially infringing copyright), yet it appears to be widely used by some developers, according to Github the tool has been used by 1.2 million users in a period of 12 months.

Infringement in outputs

Over time the accusations of possible copyright infringement from developers have continued. In a recent Twitter thread, computer science professor Tim Davis found that Copilot was suggesting some of his code back at him.

While the output was not exact, it was similar enough to warrant infringement claims. It is impossible to tell for sure from a few screenshots, but there appears to be no question that the code is similar enough, perhaps substantially so. However, Tim Davis ruled out taking legal action.

So is Copilot infringing copyright in this and other cases?

While the evidence seems strong in the case presented above, the code is not exactly the same, and some people commenting on the thread, and Davis himself, hypothesised that the source of the code may have originated from a third party who uploaded it to Github with modifications but no attribution to Davis. Github are aware that replication of code does happen from time to time, but they argue that it is very rare:

The vast majority of the code that GitHub Copilot suggests has never been seen before. Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set. Previous research showed that many of these cases happen when GitHub Copilot is unable to glean sufficient context from the code you are writing, or when there is a common, perhaps even universal, solution to the problem.

Some occasional replication is to be expected in the outputs, particularly with code that may be popular, or may be a universal solution to a specific problem. In my opinion it will depend on the specifics, but at least from the few examples of replication that I have seen, I don’t think infringement litigation would be successful, but it’s still early days.

Infringement in inputs

While it may be difficult to find infringement in outputs, the question of inputs is really where things are starting to heat up. The most interesting legal debate is happening with the data used to train machine learning models. This has been a large part of the ongoing debate with art models (discussed here), but the first shot in the future of litigation may very well involve Copilot.

Programmer and lawyer Matthew Butterick has been getting a lot of attention when he announced that he was starting an investigation into Copilot with the intention of eventually starting a class-action lawsuit against Github and their parent company Microsoft. I will not go thought his arguments in detail, but in my opinion they boil down to the following points:

  • Copilot is trained on code uploaded by Github users.
  • This code is under open source licences that have several restrictions, such as copyleft clauses and attribution requirements.
  • These restrictions are not being met, therefore Copilot is infringing the licence terms of a lot of uploaded projects, which means that their use of that code is infringing.
  • If they are infringing, then they must rely on fair use.
  • There is no fair use defence for training data for machine learning.
  • Therefore Microsoft is infringing copyright.

There is also a very strong ethical element to the complaint. Open source software communities are there to share code, but Copilot takes that code and closes it in a walled garden that contributes nothing to the community.

This is probably the biggest potential challenge to AI that we have witnessed yet, and its reach cannot be underestimated. I have been getting a few questions about this, is Butterick right?

[A big caveat before I begin, Butterick is a US lawyer, and his analysis is based on US law specifically. I’m not a US lawyer, and while I’m familiar with some of the case law, please take my opinions with a pinch of salt, and as always I’m open to corrections on the law.]

Butterick’s analysis is based on two assumptions, first, that using open source code to train a machine learning model will trigger the terms of open source licences; and second, that there’s no fair use in machine learning.

I’m not entirely convinced by the first assumption. There’s no doubt that the code stored in Github is released under open source licences, these range from academic licences such as MIT, to copyleft licences such as the GPL, the common requirement for these licences is that people can use the code to produce derivatives as long as the terms of the licence are met ( attribution, share-alike, etc). The legal question will rest entirely on whether using that code as training data will trigger the terms of those licences, and this is likely to be an argument that will have to look at the inner workings of OpenAI’s Codex. Is Codex copying the code? If so, is it generating a derivative of that code? Is reading and learning from that code “use” in the terms of a licence, and is the resulting code a derivative of that original program?

I don’t know enough of the inner workings of Codex and Copilot to answer this question, but I wouldn’t assume that it’s so easy to answer. I am reminded of the landmark case of Google v Oracle, in which Oracle claimed that Google’s use of their Java API in earlier versions of Android infringed copyright. While the court ruled that APIs had copyright, it was found that Google’s use was fair, a decision that rested in some part on the technical details of how Google’s code interacts with an API. I can envision a similar argument ensuing here, where the technical details of what happens when code is used to train data will be examined. It could go either way, but if I were a betting man I would put my money on it not being a derivative, at least based on my limited knowledge of machine learning models.

However, let’s give this first point to Butterick and assume that training data is indeed a derivative for the purposes of open source licences. What next? Then Copilot would have to rely on a fair use defence by arguing that using Github’s code to train a machine learning model is in itself fair use.

Here I agree in principle that there is no direct case law dealing with fair use in training an AI. However, there is a good argument to be made that training data is fair use, and this includes Author’s Guild v Google, and the aforementioned Google v Oracle. It is true that this is not decided, and as with the first assumption, a court case could easily go in favour those claiming copyright infringement, but I don’t think that it’s a slam dunk argument by any stretch of the imagination.

Speculation time

This is starting to look like the very first case dealing specifically with machine learning and fair use in the US. I have been expecting something like this to happen, and I am surprised that it has not taken place yet. The reason for this is that perhaps some copyright owners have been reluctant to test the assumption that training a machine learning with copyright works is fair use, as a negative decision (or a positive one if you’re on the side of AI developers) would be a devastating blow to copyright owners. At least as things stand right now there is still reasonable doubt regarding the legal situation of copyright infringement in datasets, and we have seen that some companies have been reluctant to jump on the AI bandwagon precisely because of fears of copyright infringement. An outright declaration that training data is fair use would finally put those fears to rest.

This case, if it goes ahead, could be the very first to test that theory. It could very well be successful, but I wouldn’t bet on any result. One thing is clear, if this case goes ahead it will take years, any lower court decision will be appealed, and the appeals could make it all the way to the US Supreme Court. So we’re talking years and years of uncertainty.

However, one thing is certain, and it is that other countries have already enacted legislation that declares that training machine learning is legal. The UK has a text and data mining exception to copyright for research purposes since 2014, and the EU has passed the Digital Single Market Directive in 2019 which contains an exception for text and data mining for all purposes as long as the author has not reserved their right.

The practical result of the existence of these provisions is that while litigation is ongoing in the US, most data mining and training operations will move to Europe, and US AI companies could just licence the trained models. Sure, this could potentially be challenged in court in the US, but I think that this would be difficult to enforce, particularly because the training took place in jurisdictions where it is perfectly legal. The result would be to place the US at a disadvantage in the AI arms race.

Concluding

Litigation will happen. Be it against Copilot, or against one of the AI art tools, it is evident that at some point someone will try to test out the assumption that there is fair use in the processing data for training purposes. I have no idea how things will go, honestly it could go either way. But one thing is clear, unless there is a change to the law in the next decade, Europe will become the AI training data haven.

Stay tuned.

Update: the lawsuit has now dropped, you can find the complaint here. There are a few surprises, I will write a blog post about it soon.


5 Comments

Avatar

Andy J · October 20, 2022 at 2:37 pm

I wonder how long it will be before we see ‘lobbyists’ employed by Microsoft actively trying to get the US Congress to change the law in line with the UK and EU. While that may not prevent them being sued over the Copilot project, it may prevent future suits of the same sort, assuming that Butterick gets anywhere with his proposed claim.

But given that AI coding is the future, we (meaning the legislators) need to address the fundamental issue, namely that a computer program or an app or even an API is not a novel or blockbuster work of fiction and so the former should not be entitled to the same rights and protection as the latter when it comes to copyright. This will be even more the case when the human element involved in coding is decreased to the point of just composing a wish list of outcomes and letting the AI sort out the implementation.

I don’t deny that a company which makes a substantial investment in AI to produce software should be allowed some protection on its investment; what is needed is a new sui generis right which borrows from the sort of protection provided to patents, the EU concept of design rights (nearly analogous to the US utility patent) and the EU database right. A regime which provides for a limited duration monopoly over the intellectual property, which would allow a successful piece of software to recoup the company’s investment, but also one which releases the intellectual property into the public domain at the earliest possible moment to encourage innovation. When such a law is drafted, a good deal of time and care should be devoted to working out what exactly consitutes copying so as to rule out certain circumstances, such as, in Andres’s words ‘a universal solution to a specific problem’. Whether or not the sui generis right would need its own form of fair use exceptions really depends on how the legislators frame the basic law in the first place. One other thing which must also be clear is the need to make such legislation as future proof as possible, taking into account the rate of innovation which is likely in AI in the coming years.

andrewducker · October 21, 2022 at 10:13 am

I think it depends a lot on how the training data is used.
Let’s say that you read a lot of code to train yourself. When you then wrote some code to solve a problem you wrote some code. If you remembered the code so well that you effectively transcribed it from memory, would that be fair use? I assume it would be if you simply wrote some new code that had clearly been learned from reading it, but didn’t simply repeat it.
And it seems like the same should be true of training an AI. If it just spits out what went in, then that’s not “training”, that’s just a big data collection of copyrighted code. If it writes its own new code, based on what it learned from the old code, then they can justify that as fair use.
Where the line is, of course, is a matter for lawyers.

為什么生成人工智能法律戰正在醞釀| 人工智能節拍 - News China 365 · October 23, 2022 at 12:27 pm

[…] 本週在一篇博文中說了這麼多 – […]

Why generative AI legal battles are brewing | The AI Beat – VentureBeat – Finahost Online Solutions · November 21, 2022 at 2:31 am

[…] law at the University of Sussex in the UK who has been studying legal issues around generative AI, said as much in a blog post this week – though he cautioned the legal battles could drag on for years. The GitHub Copilot case, he […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.