Stock photographer sues AI association LAION: The crux with AI training data

A stock photographer sues LAION, while the non-profit AI association argues having complied with applicable law. The dispute raises questions beyond the case.

In Pocket speichern vorlesen Druckansicht

(Bild: nep0/Shutterstock.com)

Lesezeit: 16 Min.
Von
  • Silke Hahn
Inhaltsverzeichnis

(Diesen Artikel gibt es auch auf Deutsch.)

A stock photographer has sued LAION, a Hamburg-based non-profit that promotes open-source AI and open-source datasets on which machine learning models such as Stable Diffusion have been trained. According to the account on its blog, the plaintiff had asked the association to "remove its images from the training data for the large AI systems." On April 27, 2023, he filed a lawsuit with the Hamburg District Court seeking an injunction against the copyright infringement he claimed, according to his Twitter account. The photographer Robert Kneschke had contacted the press in advance, including IT news magazine Heise.

Kneschke is going to court to obtain legal clarity on behalf of the entire industry, as he explained on the phone, and to work beyond his individual case to obtain compensation for the creators of the images used to train large machine learning models. In addition, Heise spoke with representatives of the defendant association and obtained a legal assessment from an independent copyright lawyer. The ruling could be a landmark, and lawyers are therefore watching the case with interest: It is the first time that an originator of photos has had the issue of remuneration for the exploitation of images on ML training legally clarified. However, unlike ongoing investigations – against OpenAI or Stability AI, for example –, this lawsuit hits a non-profit scientific association. English-language and social media have taken up the case, sometimes in a one-sided manner, and heatedly discussed it. The following is a factual classification of the concerns of both sides.

LAION e.V. (Large-Scale Artificial Open Network) stands for the democratization of artificial intelligence and, according to its founders, is primarily to make the new technology accessible to research. Among other things, it supports the replication of large proprietary AI models, calls for an international publicly owned supercomputing entity, and warned in an open letter to the EU Parliament in late April against overregulation by the AI Act. LAION's databases do not contain pixel data, according to the association, but rather plain text data, metadata and URLs that LAION-400M and LAION-5B use to link to imagery available elsewhere on the Internet. According to the founders of the association, the data sets are index directories (catalogs) for finding image material on the free Internet. It is possible to remove links to unwanted images from the catalog. Images, on the other hand, cannot be removed because the association does not provide images in its databases.

In the German Act on Copyright and Related Rights (Urheberrechtsgesetz – English version here), there are two exceptions for text and data mining, which LAION claimed when creating the large image datasets: Section 44b of the German Copyright Act provides a general exception, as long as image data is only used for pattern recognition or analysis and the image-text pairs are not stored after evaluation. The more specifically tailored Section 60d provides for far-reaching exceptions for research purposes if the results are not intended for commercial use or any revenue generated is returned to research.

from the German Act on Copyright and Related Rights (Urheberrechtsgesetz – UrhG)

(Bild: Gesetze im Internet (English))

The photographer questions the non-profit status and the research purpose of the association. In his blog, he argues about a personal or economic interconnection with the company Stability AI, which he tries to substantiate especially in terms of time. Stability AI had supported the association financially with a donation (according to the association and Stability AI "one-time in a small grant") as well as provided computing power. According to notarized excerpts from the register of associations, LAION was officially registered as an association in February 2022 – the creation of the data sets LAION-400M and LAION-5B already took place during 2021. Thus, in the photographer's argument, the association could not claim exceptions from the copyright law for the time before. Kneschke doubts that the association already existed before February 2022.

Yet, the origins of LAION, as well as the research purposes, can be readily traced from mostly public sources such as GitHub and Discord back into the spring of 2021. On July 7, 2021, the founding meeting of the association convened. The registration in the register of associations was not directly successful due to formal reasons, Christoph Schuhmann, one of the founding members, explained when asked. At that time, LAION had already existed as an unregistered association, and the Hamburg tax office confirmed its non-profit status.

According to Schuhmann, the photographer had sent a cease-and-desists warning to the association via his lawyer and demanded "to remove his pictures from the data set". He had held out the prospect of a charge of intentional copyright infringement in thousands of cases, and announced that he would make demands for each individual case. As a result of the written warning, the association had to hire a lawyer, too. Technically, it is not necessary to stockpile images for a URL reference catalog: In essence, the dispute is about whether the processing of the images to create an index was lawful.

According to the Berlin copyright lawyer Dr. Till Jaeger, the case is unspectacular in this respect: Foremost, it is about the legally customary compensation of costs in case of a written cease-and-desist order (not about "damages", as stated in the photographer's blog entry) – whoever is warned can defend themselves in Germany. As long as LAION does not store image data in its datasets, it operates within the copyright barrier in Section 44b of the German Copyright Act (UrhG), which provides an exception for text and data mining. According to this, the evaluation and analysis of freely available data for pattern recognition is permitted in Germany, under the condition that no copy of the images is kept after the training or analysis.

According to the association, the LAION-400M and LAION-5B datasets provided by LAION do not contain pixelated data, only text data, text embeddings, and URL references to image-text pairs available on the free Internet. The datasets are thus a catalog of index references to 400 million or, in the case of LAION-5B, five billion freely accessible images each. The third paragraph restricts that a use is only permissible if the right holder has not reserved it. However, a reservation of use is only effective if it is made "in machine-readable form".

According to the wording of the law, anyone who wants to prevent their own images from being used for machine training would therefore have to have stored a machine-readable opt-out in advance, for example in a robot.txt file. The evaluation of images through automated analysis of individual or multiple digital and digitized works to obtain information about patterns, trends and correlations, for example, is considered permissible under German copyright law. In this context, the duplication of legally accessible works for text and data mining is also permitted. "The reproductions must be deleted when they are no longer required for text and data mining," states the second section of §44b of the German Copyright Act.

One question that will have to be clarified in court is whether such a machine-readable reservation of use was stored in advance for the images that Kneschke is complaining about, and whether such a pre-stored objection to machine evaluation was considered when compiling the data set. The association also invokes Section 60d of the German Copyright Act: Text and data mining for purposes of scientific research. For this, it would be crucial that research organizations pursue non-commercial purposes, reinvest all profits in scientific research (if profits accrue), or operate in the public interest as part of a government-recognized mission. Research organizations must not collaborate with private companies that exert a determining influence on research or have preferential access to the results of scientific research.

Scientific activity is tangible in public contributions from the LAION community as of 2021. For example, the Crawling@Home project (in which LAION was a major contributor) had established the core software for compiling LAION-400M commits between June and August 2021, and LAION-400M was completed in August 2021. At that time, LAION members wrote the LAION-400M paper, which they uploaded to ArXiv in November 2021 and published as part of Andrew Ng's Datacentric AI Workshop at the prestigious NeurIPS conference. In September 2021, a paper was published that critically examines the contents of LAION-400M. Association co-founder Christoph Schuhmann first came into contact with Emad Mostaque (the CEO of Stability AI) in December 2021, according to screenshots of chat histories. At that time, the association and the first dataset already existed, and work on LAION-5B was well advanced, according to Schuhmann. Meanwhile, the LAION team received an award recognized in the research community (the Outstanding Paper Award NeurIPS 2022).

The German regulation according to §44b of the German Copyright Act is based on European law, and according to the Berlin-based copyright lawyer Dr. Till Jaeger, on the DSM Directive (Digital Single Market) on copyright in the digital single market. Since automated analysis is used for pattern recognition in machine learning, this is likely to be a typical use case. This means that copyright-protected training data such as images from the Internet can be used license-free for machine learning, as the specialist lawyer for copyright law writes in a Heise article. In the U.S., the legal situation would be different: There, the question would be whether the use of images falls under "fair use" and would thus be permitted without a license. This also relates to mere training.

In ongoing court cases against providers of trained models, the lawyer believes that it is likely to be essential whether specific images can be reproduced (which goes beyond identifying image material used for training). According to the association's scientific director, Dr. Jenia Jitsev of the Jülich Research Center, images are obtained from various research groups around the world that conduct their research locally on their own machines and do not make these images available to anyone else. LAION e.V., he said, so far, only provides index-like data sets with references to the Internet that are used by third parties to create models.

The court battle is part of a larger context: the artificial neural networks underlying the image generators are pre-trained on billions of image-text pairs from the Internet. With the new technical possibilities, the creative profession is confronted with an existential challenge to its income base, and representatives of various interest groups are wrestling over questions of copyright and ancillary copyright. The stock photographer and blogger Robert Kneschke lives off the income from the marketing and exploitation of his images. Ten years ago, the portfolio of the former amateur photographer already comprised more than 13,000 images, most of which he offers on stock image sites. In an advice article by t3n in 2014, Kneschke explained how photographers can generate over 10,000 euros in revenue per month. In a guest article for Heise in 2020, he explained the business model and reported on how the market has changed over the past 20 years.

Kneschke uses AI himself: for example, his own portfolio now includes around 3,000 AI-generated stock images, and he hosts commercial courses on working with AI generators. He does not see himself as an opponent of AI image generation, as he emphasized to Heise. He is concerned with the principle and legal clarification of how images can be exploited, he said. Furthermore, he has dealt with the generation of images by AI both theoretically and practically, and explains in his blog the generation of images in models such as Stable Diffusion. Robert Kneschke is aware that AI models are not based on image databases. In a blog entry, he explains how Stable Diffusion works: "The AI does not simply copy parts of existing images, but the information comes from the so-called latent space."

The remuneration of artists and authors for derivative works is currently still unregulated and will "probably soon have to be clarified by the courts," according to Kneschke. In doing so, the courts will face a fundamental problem: The models interpolate works from latent space that are no longer based on a clearly identifiable individual work – but the principle of copyright and copyright law presupposes clearly identifiable works.

According to the photographer, his goal is not "not to be in the data set." He sees his lawsuit as a contribution to clarifying whether the laws are being observed. The goal, he says, is appropriate remuneration for creators. As a stock photographer, he and his colleagues make a living from the fact that images, including text descriptions, can be found at agencies – a withdrawal from the Internet is therefore out of the question. One problem is that as a stock photographer, you have no control over what customers do with the images after they have bought them: The images show up duplicated in the public domain. A machine-readable opt-out is easily lost along the way, Kneschke said. In addition, the effort required by those affected would be immense if each image had to be taken down individually manually. The millions of well-described stock images are valuable for machine learning, he added.

On the one hand, there is the concern of image creators and creatives who make a living from producing and licensing their work, on whose publicly viewable works AI models have been trained without their consent, and whose incomes are shrinking as a result of the automatically generated AI output. On the other hand, there are tensions between proprietary and open-source approaches and tendencies toward monopoly. It is in the public interest that people have a choice in generative AI in the future through different offers and providers and that a balance of interests takes place.

What happens next in the Kneschke v. LAION case remains open. After all, copyright law is under pressure in times of generative AI systems, and new cases may help clarify the situation.

[Disclaimer: The law firm of Heise lawyer Joerg Heidrich is representing LAION in the proceedings. He was not involved in the writing of the article].

(sih)