Open-source AI: LAION proposes to openly replicate GPT-4 – a public call

Truly open AI: LAION calls for a supercomputer to develop open-source AI, replicate large models like GPT-4 and explore them together as a research community.

In Pocket speichern vorlesen Druckansicht

(Bild: Photobank gallery / Shutterstock.com)

Lesezeit: 13 Min.
Von
  • Silke Hahn
Inhaltsverzeichnis

(Diesen Artikel gibt es auch auf Deutsch.)

The pause in developing large-scale AI models called for in an Open Letter is stirring up tempers and opposition, for example from open-source advocates. The non-profit Large-Scale Artificial Intelligence Network (LAION, a registered association) has published a counterpoint. Instead of a pause, LAION calls for accelerating research and establishing a joint, international computing cluster for large-scale open-source artificial intelligence models. The Swiss CERN serves as a model.

Prominent members of the AI research community support the call, such as theoretical physicist Surya Ganguli (Stanford), Jürgen Schmidhuber (head of the Swiss AI lab IDSIA), Irina Rish (ML professor in Montréal), Darmstadt AI professor Kristian Kersting (co-director of hessian. AI), Thomas Wolf (founder of Hugging Face), Konrad Koerding (University of Pennsylvania) and Joscha Bach (cognitive scientist and AI researcher at Intel); the Head of Strategy of Stability AI David Ha (known in the AI scene under the pseudonym "hardmaru") and Robin Rombach (main author of Stable Diffusion) are also among the supporters.

The Association for Open-Source AI, based in Germany, promotes open AI models, transparent research and freely accessible datasets. The non-profit organization researches large AI foundation models and provides datasets, tools and models. Today's large (vision) language models for text-to-image synthesis, such as Stable Diffusion and Google Imagen, are primarily based on LAION datasets. LAION is backed by a community of about 20,000 people worldwide who conduct research in machine learning and artificial intelligence.

LAION's goals are critical AI security research that benefits the public and technological independence from commercial AI models of large corporations that do not disclose any technical details about their products, but at the same time extract user data for further training of their models (black boxes like ChatGPT and GPT-4). The call for cooperation is directed at the European Union, USA, Great Britain, Canada and Australia.

LAION proposes to democratize AI research and build a publicly funded supercomputer with 100,000 powerful accelerators (such as graphical processing units) for training foundation models to create open-source replicas of models as large and powerful as GPT-4, as quickly as possible. The association has extensive experience with supercomputers and large language-vision models. One of the founding members and scientific director of LAION, Dr. Jenia Jitsev, works as a research group leader at the Helmholtz Society's high-performance computing centre in Jülich (Juelich Supercomputing Center, or JSC for short). This is where JUWELS, Germany's largest scientific supercomputer and one of the largest high-performance computing clusters in Europe, is located. JUWELS is equipped with 4,000 NVIDIA A100 GPUs and is used, among other things, for quantum calculations.

LAION researched the scaling laws of another important class of image-text models: openCLIP models, an open alternative to the deep learning model CLIP (Contrastive Language Image Pre-Training) introduced by OpenAI, were trained at the Jülich computing centre. The association published the previously unavailable models as open source in cooperation with the openCLIP community. Up to now, CLIP from OpenAI has formed the training basis for generative and numerous other models with word-image recognition; the first version of Stable Diffusion, for example, was still trained with CLIP in addition to the labelled image data sets from LAION.

Superpowers with public funding: Supercomputer JUWELS

JUWELS – a multi-petaflop modular supercomputer operated by Jülich Supercomputing Centre. Copyright: Forschungszentrum Jülich

(Bild: Forschungszentrum Jülich)

Dr. Jenia Jitsev, one of the co-founders and scientific directors of LAION, works at the Juelich Supercomputing Center of the Helmholtz Association, where he heads his own research group. There is JUWELS, one of the largest supercomputers in Europe, equipped with over 4000 NVIDIA A100 GPUs. Together with scientists involved in LAION such as Ludwig Schmidt from the University of Washington and Irina Rish from the University of Montreal as well as experienced AI software engineers such as Ross Wightman, the author of the PyTorch Image Models and Timm library, Jitsev applied for computing time on government-funded supercomputers such as JUWELS or SUMMIT at the Oak Ridge Laboratory in the USA.

Only such large machines make AI research possible at scales where it becomes exciting and where research is carried out in the large industrial labs of Google, Meta and Co, according to Jitsev. However, even these supercomputers are still too small compared to what is available in such industrial labs. It recently became known that OpenAI's partner, Microsoft, is buying thousands of GPUs to build a new supercomputing structure. Therefore, according to LAION's scientific director, it is absolutely necessary to build supercomputers of sufficient size from public funds. This is the only way to ensure independent and transparent research in this field, which is of enormous importance to society.

In Jülich, research projects can apply for computer access. The website contains information on scientific clouds, quantum computing and an FAQ.

LAION 2022 received the NeurIPS Outstanding Paper Award for work on the LAION-5B dataset and its validation through openCLIP models. openCLIP represents a breakthrough for the democratization of large-scale language vision models, the jury said. The Conference on Neural Information Processing Systems (NeurIPS) has existed since 1987. It is considered one of the most important conferences for artificial intelligence worldwide, and the award by an independent NeurIPS jury carries weight in the research community – more on this can be found in a blog entry by the organizers.

An open dataset allows for far-reaching control and governance, openCLIP reduces dependencies on the preferences or negligence of individual commercial providers. CLIP is known to reproduce biases – this created many problems because the bias of the dataset feeds into the respective training and model. With openCLIP, researchers have more options to control the training of their models themselves.

LAION advocates for public infrastructure for training accessible state-of-the-art large AI models, so that independent experts and society have unfiltered access to basic technology and open alternatives to powerful models like GPT-4 emerge – it would be risky for humanity if the world's population depended on opaque commercial offerings from a few corporations with monopoly positions via a single API, Christoph Schuhmann told heise Developer. Like Jenia Jitsev, the Hamburg computer science teacher is one of the seven founding members of the association.

The problem is that providers can change the nature and behaviour of the models at will and without the knowledge or input of customers. Large corporations like Microsoft-OpenAI withhold technical information for competitive reasons. For example, the Technical Report on GPT-4 does not provide relevant information on size, training, training data and architecture, so independent security researchers cannot work with them.

A recent blog entry by OpenAI regarding AI security has disappointed private customers and researchers alike. The text scratches the surface, again gives no technical details and seems marketing-driven. The message is probably a half-hearted reaction to the investigations against OpenAI in the USA, Canada and Italy. In these countries, national authorities are currently investigating partly because of data protection concerns and security risks, partly because of suspected competition violations in the market launch of GPT-4 and ChatGPT.

OpenAI does not mention existential and already known risks in the blog entry "Our Approach to AI Safety". It also remains unclear what concrete measures the company is taking to protect its users, and what exactly the six-month security training for GPT-4 consisted of. Given the already tangible risks posed by AI, such as disinformation, algorithmic clustering and far-reaching and non-transparent processing of user data, there is an urgent need for alternatives, according to LAION. Masses of public data are being siphoned off for the new basic technology and processed for profit in the case of American data and platform corporations.

OpenAI stores all user data for 30 days on American servers and reserves the right to analyse it itself or have it analysed by undeclared third parties, according to their website. At the latest, when GPT-4 and ChatGPT are fully integrated into the widespread Microsoft Office suite, billions of people will use the system in one fell swoop, Schuhmann points out. The public has a right to non-profit progress, accessibility, and participation – and to information instead of marketing. At present, he says, research and the entire academic world are dependent on financially powerful tech companies and their hardware. Even NASA's biggest computer is only half as fast as that of Stability AI.

OpenAssistant: an Open-Source-ChatGPT

(Bild: Yannic Kilcher's broadcast OpenAssistant (Screenshot))

Together with YouTuber and AI influencer Yannic Kilcher, LAION is working on OpenAssistant, an open-source variant of ChatGPT. Volunteers carry the chatbot project: they create sample solutions and evaluate the answers of others. Afterwards, other participants train chat models on this basis and publish them as open source. Meanwhile, the first work-in-progress models are available as unofficial demos - Yannic Kilcher presented them in his YouTube broadcast on 7 April 2023 ("OpenAssistant First Models are here: Open-Source ChatGPT").

LAION and Kilcher plan to publish the training data collected so far and the first official chat models "in a couple of weeks", they said upon request. The prototypes can be tested on Hugging Face. Further details can be found on the OpenAssistant project website.

Update: OpenAssistant ("Conversational AI for everyone") was officially released on 15 April 2023.

Technology philosopher Armin Grunwald also takes a critical view of the power and ownership structures: a disproportionately large share of research on digitalization and AI is concentrated in privately managed companies, especially in big data corporations from the USA and China. This insight is also supported by the just published AI Index Report 2023 by the Stanford Institute for Human-Centered AI (Stanford HAI) with the collaboration of Hugging Face and Jack Clark (Anthropic), which warns of global distortions due to the concentration of research and capital in the USA and China. The social and future visions of a few managers and multi-billionaires in monopoly corporations are shaping our future through their decisions without public debate, voice and legitimacy.

The real loss of control we need to protect ourselves from is not the loss of control to algorithms mentioned in the open letter, says Grunwald - that would be a pointless worry, since algorithms and computational programmes have neither intentions nor instincts for power. "Their makers, however, have these in abundance," warns the technology philosopher. The problem with current AI development is not the impending loss of power to algorithms, but the non-transparent concentration of power over future society in the hands of a few. Grunwald points out: "Of course, foresighted considerations, impact research and ethics are needed for the future development of AI. But these will remain toothless if nothing changes in the aforementioned power constellation."

Christoph Schuhmann believes it is possible to replicate GPT-4. From the combined knowledge of similar models like LLaMA and Flamingo, he says, enough information can be derived to tackle it. For example, LLaMA was trained with 1.5 trillion tokens, for GPT-4 one can assume a tenfold amount and the training probably also included images (multimodal). The context length was increased and presumably a "mixture of experts" architecture was used, for example a differently weighted training of the different subnetworks of the model. It is common to fine-tune about ten to twenty percent of the middle layers (subnetworks) to specific tasks and keep the rest of the model the same.

These are two different things, Dr Jitsev points out: A mixture-of-experts (MoE) architecture, as was probably used in GPT-4, and finetuning techniques. Finetuning is often used to adapt a small portion of the overall network for a specific task. There is work on how to recombine such finetuned networks. However, it is not clear whether this works well on large scales – like GPT-4 has –. So, it is unclear what this can have to do with GPT-4. It is quite certain that MoE training was used for this. Everything else is speculation, according to Jitsev.

Enough research papers explain this approach, in which only the parameters of the middle layers are exchanged and about 80 percent of the foundation model remains the same. The middle layers could be copied into the active RAM and exchanged via CPU offloading. The breakthrough would then be to really scale up the training, with high-quality data. Above all, now is the time, since according to Anthropic, the lead of large, well-funded tech companies is likely to become unassailable from 2025/26. According to leaked business documents, Anthropic is apparently planning an AI model in the order of 10 25 floating-point operations (FLOPs) – the AI system should be ten times as powerful as the currently most powerful models and Anthropic intends to invest 5 billion US dollars in the development.

A detailed presentation of LAION's position with pros and cons of large AI language model projects can be found at Open Petition. More about the association and its background can be found in an interview with computer scientist Christoph Schuhmann, one of the seven founders of LAION. In the next two months, the non-profit open-source AI association LAION will be collecting signatures to demand an AI supercomputing cluster in public hands.

The German Ethics Council took a stand on the current AI development in March. The approximately 300-page statement is publicly available as a PDF document (an Opinion in German), the topic: "Human and Machine – Challenges posed by Artificial Intelligence" (English summary).

(sih)