From Google-Search to GPT4 Credit:analyticsinside.net
Introduction
Language plays a central role how we communicate, how we learn and teach, how we organize and take political action and how we convey the emotions we experience in our lives. While Google-Search is used to profile the user based on the topics of his interests, GPT4 can capture our emotions as pure text prompts are enhanced with multimodal capabilities with a new camera-based visual question-answering mode provided by Microsoft’s KOSMOS-1 software. Observing the user visually raises many issues about the application of this technology. To some GPT4 is approaching Artificial General intelligence (AGI) considered the ‘Holy Grail’ of AI. To others capturing and profiling our emotions will result in the loss of one’s individuality and personality.
Some Theories about Emotions
Marvin Minsky, one of the founding fathers of AI, was once questioned about his view on machine-emotions. His response was;
“The question is not whether intelligent machines can have any emotions, but
whether machines can be intelligent without any emotions”.
Indeed, without emotions we would not have survived as a species. While our intelligence has improved, emotions such as fear have caused our evolutionary survival mechanism to take effect. However, emotions can also be invoked as the result of an internal thought process. Finding a solution to a complicated mathematical problem can trigger happiness. It may be a purely introspective action with no external cause, but solving the problem still triggers an emotion. Emotions like joy, sadness, surprise, disappointment, fear and anger can now be simulated with computational methods such as GPT4 thereby capturing our emotions as we interact and communicate with an intelligent machine. There are different theories to explain what emotions are and how they operate. The following is a summary of the two most popular:
Evolutionary Theories: According to Wikipedia, one can distinguish between two different ways by which the evolutionary process has emerged. The first is based on the claim that emotions are the result of natural selection that occurred in early hominids. The second claims that emotions are indeed adaptations but suggests that the selection occurred much earlier. Robert Plutchik, a well-known evolutionist, claims that there are eight basic emotions. Each one is an adaptation to external events and all eight are found in living organisms. According to Plutchik, emotions are similar to human traits like DNA. They are so important that they arose once and have been conserved ever since. In the case of emotions – which he calls ‘basic adaptations needed by all organisms in the struggle for individual survival’ – Plutchik suggests that the selection occurred in the Cambrian era 600 million years ago. In his view the eight major factors that form the basis of our emotions are incorporation, rejection, destruction, protection, reproduction, reintegration, orientation and exploration.
Social Theories: According to these theories emotions are acquired or learned by individuals through experience. Emotions typically occur in social settings and during interpersonal transactions rather than being one’s individual response to a particular stimulus. Emotions and their expressions are regulated by social norms, values and expectations. These norms and values influence what the appropriate reactions of emotions are and what events should make a person angry, happy or jealous.
From Emotion to Perception
Emotion and perception are closely related. In some cases, this relationship may be internal; for example, a thought or the memory of a past personal experience. The early part of the emotion process defines the activity between the perception and the triggering of the emotion. The later part describes the bodily response with changes in heart rate, blood pressure, facial expression and skin conductivity.
William James (1884) was the first to develop a somatic feedback theory, which recently has been revived and expanded by the neuroscientist Antonio Damasio, Professor from the University of Southern Californio and the philosopher Jesse Prinz, Professor at the City University of New York. Somatic feedback theories suggest that once the bodily response has been generated, the mind registers these bodily activities. Emotions provide the scaffolding for the construction of social cognition and are required for the processes which define consciousness. This mental state caused by the bodily response defines Emotion. A consequence of this view is that without a bodily response there cannot be an emotion. According to Wikipedia, Perception defines the organization, identification and interpretation of sensory information in order to understand the information provided by the environment. Perception involves signals that go through the nervous system which in turn result from physical or chemical stimulation of the sensory system. Vision involves light entering the retina of the eye, smell is mediated by odor molecules and hearing involves the sensing of pressure waves. A process related to this sensory input transforms this low-level information to higher-level information to extract shapes for object recognition, for example. Perception depends on complex functions of the nervous system, but subjectively seems mostly effortless because this activity occurs outside our conscious awareness. Although individuals traditionally viewed the senses as passive receptors, the study of illusions and ambiguous images has demonstrated that the brain’s perceptual system actively and pre-consciously attempts to make sense of this input. The perceptual system of the brain enables individuals to see the world around them to be stable, even though the sensory information is typically incomplete and rapidly varying. Human brains are structured in a modular way. Different areas of the brain process different kinds of sensory information. This data is recorded and mapped by the brain’s biological structure and used for decision-making, for example.
GPT4: Aligning Perception with Large Language Models
The convergence of language, multimodal perception and world modelling is considered a prerequisite for achieving some form of advanced artificial intelligence. In a paper published by a Microsoft Research Unit in March 2023, a software dubbed KOSMOS-1 is introduced. It describes a Multimodal Large Language Model (MLLM) that can sense and perceive general modalities such as vision, learn from context and follow self-assigned instructions. The capability of perceiving multimodal input is critical to Large Language Models (LLMs) for the following reasons: First, multimodal perception enables LLMs to acquire common-sense knowledge beyond text descriptions. Second, aligning perception with LLMs opens the door to new tasks, such as guiding the behaviour of robots or analysing the intelligence imbedded in documents. Third, the capability of perception unifies various applications with graphical user interfaces that represent the most unified way to interact with the system. The KOSMOS-1 transformer model is trained on web-scale multimodal large-scale text resources. In addition, high-quality images and arbitrarily interleaved image and text documents are mined from the web. The research team evaluated various model-settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting on a wide range of tasks without any gradient updates or finetuning. Experimental results show that KOSMOS-1 achieves impressive performance in language understanding, generation of document-based images, perception tasks including multimodal dialogue, visual question answering and vision tasks such as image recognition. The backbone of KOSMOS-1 is a transformer-based causal language model. It can perceive general modalities, follow instructions, learn based on context and generate output. Apart from text, other modalities are embedded and fed into the language model while the transformer decoder serves as a general-purpose interface to multimodal input. As the range of available sensors is continuously expanded it seems likely that transformer-based models such as GPT4, in combination with Adversarial Neural Networks (ANNs), signal a paradigm-shift in the application of AI. Artificial General Intelligence (AGI) – with all its potential benefits and risks – has moved just one step closer to reality.
Conclusion
According to Researchers from Stanford University, MLLMs epitomize a broad paradigm shift towards so-called foundation models which are specified as machine learning models that can be adapted to an impressively wide range of tasks. Billions of individuals will be touched by the impact of this revolutionary technology. With all the excitement and fear surrounding language models, we need to know what this technology can and cannot do and what risks it poses. We must have a deeper scientific understanding and a more comprehensive account of its societal impact. Transparency is the vital first step towards these goals. Transparency begets trust and standards. By taking a step towards transparency, the researchers aim to transform foundation models from an immature emerging technology to a reliable infrastructure that embodies human values.