While we’re in the middle of a Golden Age for the discussion of copyright and artificial intelligence, one topic that may not have received as much attention as it warrants is the interaction between open content and AI, and in particular content released under some-rights-reserved licensed such as Creative Commons.

AI requires vast amounts of data to train a model, and while the legality of the use of copyright works for training is under scrutiny in the courts, some models are being trained using open content, namely text, pictures, and music under CC licences, and in the case of software open source software. An example of this is precisely this blog, which is released under a Creative Commons Attribution-NonCommercial-ShareAlike licence. The idea here is that AI developers can use such content lawfully, as the terms of the licence would normally allow the use of these works for training. The legal question however is whether training an AI model with CC-licensed content is indeed in compliance with the terms of the licence, or whether it could fall foul of some of the terms contained in the licences. I think that CC licences are fully compatible with AI training, and in fact allow it to take place without asking from permission from the licensor, I’ll explain my reasoning later.

Creative Commons itself has been talking about the relationship in articles, and more recently in the annual CC Summit. An article by CC’s General Counsel Kat Walsh discusses the legal and ethical uncertainties surrounding the use of copyrighted inputs for AI training and the implications for the open commons. CC licenses can grant permission for copyrighted work reuse, but they don’t override existing copyright exceptions. This is particularly relevant for AI, as using copyrighted works for AI training could be protected under fair use in the US or data mining exceptions in the EU, depending on the case. Furthermore, the article explores how CC licenses apply to works created using generative AI. While recognising that rapidly evolving AI technologies pose legal challenges, it states that creators can still apply CC licenses to their works created with AI tools. However, there are ethical concerns beyond copyright, such as privacy, consent, and bias, which require broader solutions beyond just legal frameworks. The article concludes that despite these challenges, generative AI offers great potential for creativity and sharing, but concerns like author recognition, fair compensation, and the impact of AI-generated works on the commons must be addressed to prevent deterring creators from contributing to the commons​.

Then there’s an article that details findings from the annual CC Summit in Mexico. The article discusses the challenges and opportunities presented by generative AI for creators and the global commons. It emerged from discussions at the CC Global Summit, identifying key issues and formulating principles to navigate the complex landscape of generative AI. Key considerations include the varied legal status of using copyrighted works for AI training globally, the capital-intensive nature of AI development, and concerns over power imbalances and colonialist patterns in AI’s development and use. It emphasizes the need for principles addressing the position of creators, machine learning system builders, and the commons in this context​. Seven principles were proposed to regulate generative AI models. These include ensuring continued access to existing works for study and creation, defining ways for creators and rightsholders to express their preferences on AI training, addressing broader rights and interests beyond copyright, special attention to traditional knowledge in AI training, legal allowances for using copyrighted works for noncommercial public interest purposes, ensuring broad economic benefits from AI, and balancing resource concentration with public investment in computational infrastructures and training datasets respecting these principles.

I used to be quite involved with CC, but I haven’t worked with them at all in recent years, other than attending a couple of meetings here and there, so I haven’t been engaged in these discussions internally, but I tend to agree with most of the discussions and current outputs. I strongly believe that AI training meets the general ideals contained in CC licences, but also the spirit of the movement. Creative Commons exists as a legal hack to ensure the widest possible sharing and reuse of works for a wide variety of purposes, and that often includes commercial re-uses, even though some people think that this is incompatible with the licences.

So what about the legalities? Why are some people doubting that AI training is compatible with CC? I think that this arises from both a misunderstanding of the licences, but also from not considering what exactly is happening when a model is trained from a technical perspective. Let me elaborate.

CC licences come in 6 flavours that use a combination of 4 elements: Attribution (BY), NonCommercial (NC), ShareAlike (SA), and NoDerivs (ND). The six licences are BY, BY-NC, BY-SA, BY-ND, BY-NC-SA, and BY-NC-ND. The licences allow users to excercise one of the exclusive rights of the author protected by copyright without having to ask the licensor, as long as they comply with the terms and conditions, those terms are each of the four elements described above. Let’s use the one I use (BY-NC-SA) to explain this. If you wanted to re-publish my blog, you can do it as long as you comply with the three main elements, that is, you have to attribute me, you cannot do it for commercial purposes, and you have to share alike, that is, any derivatives have to be released under the same licence, namely BY-NC-SA in this instance. This is known as a copyleft clause. The argument against using CC for training would go something like this: if you want to use my blog to train an AI, you have to comply with those same requirements, therefore you have to attribute me, you can’t do it commercially, and you have to release your model under the same licence.

Easy, now let’s find if my blog is in the training data of a model, and I can sue them for copyright infringement for breaching the terms of the licence, right?

Not so fast. The first stumbling block is that not all uses fall under the licence restrictions. Roughly speaking there are two types of uses contained in CC licences: the sharing of a work, and the adaptation of the work. Sharing is defined as “any act that shares the work to the public by means of reproduction, public display, public performance, distribution, dissemination, communication, or importation”. An adaptation is defined as a work protected by copyright that is “derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor.” There is a restriction to share and adapt the work for commercial purposes, and the copyleft clause only kicks in for adaptations, so if you adapt my blog into a play, then you have to release the play under BY-NC-SA.

So far so good, I can still sue those who train an AI without my permission, right? Again, not necessarily. The second stumbling block is in the very nature of training. So let’s say a company uses my blog in their training data, they make a copy, that is a reproduction, but they are not sharing it, they are just extracting statistical data from my words, so the sharing part of the requirements are not met. Moreover, extracting information from a work is not an exclusive right of the author, training a model with a legal copy doesn’t infringe copyright in itself. And finally, to train a model is not to create a derivative of a work, this is starting to come up in some of the ongoing cases, to make a long story short, it has been argued that all outputs generated by an AI are derivatives of all of the inputs, but this argument holds no water, when you ask Bard or Claude to write a poem, those words aren’t a derivative of this blog, or of any other input used in training. So the copyleft requirement is not needed because this only applies if there is a derivative or adaptation of the work.

Finally, and perhaps the most important element, is that all Creative Commons licences clearly state that none of the requirements contained in the licence will act in the case that a use is protected by existing exceptions and limitations to copyright, such as fair use or fair dealing. In the US the courts are currently handling the question of whether making a copy to train a model falls under fair use, so this issue is still open. In Europe and other countries there are already exceptions for text and data mining which could fall under this blanket approval of reuse. Furthermore, in many countries to create a private copy of a work is also fair dealing, which could apply to the making of some private copies for training. As mentioned, extracting information from a work does not infringe copyright.

And finally, all of the above does not apply to all CC licences. While attribution is needed for derivatives or shared works, the works aren’t being shared, some CC licences such as BY, BY-ND, and BY-SA actually allow for commercial uses of the work, in which case copies of those do not require permission from the rights-holder, and the SA and ND requirements are also met because the outputs are not derivatives of the inputs, they don’t contain copies.

Concluding

Creative Commons has been used for around 20 years, and the number of lawsuits involving works released under these licences has been minimal. People choose to share their works with CC licenses for various reasons, some are selfish, some altruistic, and some pragmatic. Personally, I have always enjoyed sharing. Since I don’t anticipate earning money from my writing, I prefer making my works freely available with minimal restrictions. CC licenses facilitate this by signalling to others that they can share my work. However, this philosophy might not resonate with everyone. For those who do not share this view, CC may not be the ideal choice. If you prefer not to have your works widely shared, avoiding open licenses and utilizing technical tools and opt-outs might be a better approach. Respecting individual preferences will be crucial moving forward. I believe we are approaching a landscape similar to what we have seen with open access and open content, where such considerations are increasingly significant.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.