The Rise Of Generative AI And The Growing Problem With Data

Posted by Peter Rudin on 20. September 2024 in Essay

Generative AI                  Credit: NTT Data

Introduction

AI-generated text and imagery is flooding the web – a trend that could become a huge problem for generative AI models. According to physicist Aatish Bhatia writing for the New York Times, an increasing number of research reports show that training generative AI models on AI-generated content causes models to erode similar to inbreeding. Last year the AI researcher Jathan Sadowski from Monash University in Melbourne, Australia, dubbed the phenomenon as ‘Habsburg AI’,  a reference to Europe’s famously inbred royal family. Hence, training with AI generated datasets reduces the credibility of the recommendation delivered by the system and is likely to negatively impact the value of generative AI-models.

What Is Generative AI?

Generative AI can be thought of as a machine-learning model that is trained to create new data, often in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics. The recent excitement around generative AI has been driven by the simplicity of new user interfaces that enable users to generate graphics and videos in a matter of seconds. Generative AI can be described as a subfield of AI that focuses on creating new content, such as audio, and synthetic datasets that imitate real-world data but are artificially generated, using deep learning algorithms. The term Deep Learning refers to a specific type of machine learning that uses neural networks with multiple layers to analyse and learn from data. Deep Learning is a type of Machine Learning that uses artificial neural networks. Algorithms learn from large amounts of data to identify patterns to make decisions. In contrast, generative AI refers to AI technologies that can create new content, ideas, or data that are coherent and plausible, often resembling human-generated outputs. It has a vast range of practical applications in different domains such as computer vision, natural language processing and music generation. At the heart of generative AI, so-called foundation models represent the brains behind the process. The success of their generative outputs depends on three essential components: diversity, quality and coherence. Generative AI models strive to produce outputs that are varied, high-quality and logically connected. There are different types of generative AI models, underlining that they are not confined to isolated categories but synergize for more advanced applications. Each of these models utilizes generative AI to create new and novel content based on the data they have been provided with. Recent advancements include the ability of generative AI models to understand context more deeply, thereby leading to more coherent outputs. Additionally, efforts are under way in fine-tuning models for specific industries, such as healthcare and finance and to showcase the adaptability for addressing domain-specific challenges.

The Problem With limited Data

Because the availability of real-world data is limited, generative models are training on synthetic data. As a result, synthetic datasets of all kinds are rapidly proliferating. Publicly available generative models have not only revolutionized the image, audio and text domains , but they are also starting to impact the creation of videos, 3D models, graphs, software and websites. Companies like Google, Microsoft, are incorporating generative models into their consumer services, often with no indication that the data is synthesized. Moreover, AI-synthesized data is increasingly used by a wide range of applications for several reasons. First, it can be much easier, faster and cheaper to synthesize training data rather than source real-world samples, particularly for data-scarce applications. Second, in some situations synthetic data augmentation has been empirically found to boost AI system performance. Third, synthetic data can protect privacy in sensitive applications like medical imaging or medical record aggregation. Fourth, and most importantly, as the demand for data required by deep learning models continues to grow rapidly, developers are simply running out of real data on which to train them. As a result, not only have system developers begun training AI systems on synthetic data, but the human annotators who provide gold-standard annotations for supervised learning tasks are increasingly using generative models as well, thereby improving their own productivity and at the same time reducing the cost of generative AI-models.

Potential Of A Generative Model Collapse

With data having such a high premium, there are indications that developers have  to work harder to source high-quality data. For example, the documentation accompanying the GPT-4 release required an unprecedented number of staff involved in the data-related parts of the project. We may also be running out of real data produced by humans. Some estimates predict that the pool of available human-generated data might be replenished by 2026 which, according to some experts, could lead to a catastrophic collapse of generative AI models.  Preventing this to happen, OpenAI and others are racing to shore up exclusive partnerships with content industry leaders such as Shutterstock, Associated Press and NewsCorp. They own large proprietary collections of human generated data that are not readily available on the public internet. Hence, it is becoming impossible to reliably distinguish between human-generated and AI-generated content. Nevertheless, the prospects of a catastrophic model collapse might be overstated. Most research so far looks at cases where synthetic data replaces human data. In practice however, human and AI generated data can complement each other which reduces the likelihood of a collapse. The most likely future scenario defines an ecosystem of diverse generative AI platforms rather than one monolithic model. This also increases robustness against a collapse. Hence, a flood of synthetic content might not pose an existential threat to the progress of AI development, but it does impact the digital public good stored by the internet. For instance, researchers found that activities of the coding website StackOverflow dropped by 16 percent after the release of ChatGPT. In addition, as AI-generated content becomes systematically homogeneous, we risk losing socio-cultural diversity and some groups of people could even experience cultural erasure. One method to remedy this would be watermarking AI-generated content.

Today’s Generative AI Applications

Despite the potential problems associated with a possible generative AI Model collapse, a number of applications have been developed that defy such a scenario. What all the different approaches of generative AI have in common is that they convert inputs into a set of tokens, which are numerical representations of chunks of data. As long as one’s data can be converted into this standard format, then in theory, one could apply these methods to generate new synthetic data which opens up an array of generative AI applications. According to an article published by MIT News late last year, Phillip Isola, an associate professor of electrical engineering and computer science at MIT and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) is heading up a group of researchers  that are using generative AI to create synthetic image data that can be used to train other intelligent systems, for example by teaching a computer vision model how to recognize objects. Another group headed by Tommi Jaakkola, Thomas Siebel Professor of Electrical Engineering and Computer Science at MIT, who is also a member of CSAIL and the Institute for Data, Systems, and Society (IDSS), is using generative AI to design novel protein structures or valid crystal structures that specify new materials. The same way a generative model learns the dependencies of language, if it is confronted with crystal structures instead, it can learn the relationships that make structures stable such  that they can be realized. However, while generative Models can achieve outstanding results, they are not always the best choice for solving a problem. For tasks that involve making predictions on well-structured data, such as the tabular data in a spreadsheet, Generative AI models tend to be outperformed by traditional machine-learning methods, according to experiments conducted by another scientist working at MIT.

Conclusion

We urgently need cross-disciplinary research on the social and cultural challenges posed by AI systems and their application of generative AI. Human interactions and human datasets are important to reduce the possible risk of a future generative model collapse based on the dependency on synthetic data. If managed correctly, generative AI provides a powerful tool beyond traditional data handling methods as applied by deep learning, referring to a specific type of learning based on neural networks. Above all, generative AI offers a powerful interface for solving problems beyond the capabilities of today’s traditional AI-systems.

Leave a Reply

Your email address will not be published. Required fields are marked *