Can GPT-4 Perform Reasoning Similar to Humans?

Posted by Peter Rudin on 12. January 2024 in Essay

The Future of GPT-4                      Credit: OpenAI


One of the unique features of GPT- 4 is its ability to process Large Language Models (LMMs). This new breed of AI-systems is more adaptable, interactive, and capable of handling complex, multi-faceted tasks in dynamic environments. They learn to adapt, using sensors and actuators for achieving a user defined purpose and independently decide which steps to take to achieve the best solution for solving a specific problem, for example automatically steering a car through dense traffic. By learning they gradually improve their capacity for decision making. However, there is an ongoing debate whether LMMs, as supported by GPT-4, are capable of replicating human reasoning. Some researchers maintain that LMMs may develop abstract reasoning based on pattern recognition. Others argue that the internal workings that facilitate such capabilities remain unexplained. Moreover, some experiments show that these models fail to provide new insights beyond the scope of their training data. The following discusses some of the current research efforts for developing AI-machines with human-like reasoning capabilities.

A Definition of Human Reasoning 

According to Wikipedia human reasoning refers to the cognitive process of making sense of the world around us, drawing conclusions and solving problems based on available information and past experiences. The study as to how humans develop reasoning abilities is an ongoing area of research in psychology. It involves the use of logic, critical thinking, and inference to reach conclusions or make decisions. However, several experiments show that human assessment of a problem to be solved might be incorrect, hence results need to be carefully analysed. One of the most common areas whereby people employ reasoning occurs with sentences in everyday language, exchanging ideas and various arguments to reason about a problem to be solved. Another theory assumes that people rely on mental representations and mental models that correspond to imagined possibilities. Current research supports the notion that multiple approaches can be useful in modelling human thinking. These models assume that people tend to reason about problems by constructing multiple mental representations of a given situation. This process can help them to identify relevant features and to make inferences based on their understanding of the problem. Several models have been proposed to explain how, why and when reasoning develops from infancy to adulthood. Recent research suggests that early experiences and social interactions play a critical role in the development of reasoning abilities. For example, studies have shown that infants as young as six months old can engage in basic logical reasoning, such as reasoning about the relationship between objects and their properties. The importance of parental interaction and cognitive stimulation in the development of children’s reasoning abilities is considered to be important as well. Additionally, studies have suggested that cultural factors, such as educational practices and the emphasis on critical thinking, can also influence the development of reasoning skills. In Neuroscience research, studying reasoning involves measuring the neural correlates, using functional magnetic resonance imaging (fmri) to monitor brain activities during the process of reasoning. With this rapidly growing discipline of research one can expect further insights about human reasoning.

A Definition of  Machine-Based Reasoning

According to a study published by Microsoft Research Machine Reasoning: Technology, Dilemma and Future (, machine reasoning aims to build AI systems that can solve problems or draw conclusions through facts and observations.  Although its ‘formal’ definitions vary according to different publications (McCarthy, 1958; Pearl, 1988; Khardon and Roth, 1994; Bottou, 2011; Bengio, 2019), machine reasoning methods share some commonalities. First, these systems are based on different types of knowledge, represented as logical rules, knowledge graphs, common sense or text evidence. Second, these systems use different inference algorithms to manipulate available knowledge for problem-solving. Third, these systems are capable of interpreting data to predict future events. Over the past decade the development of machine reasoning systems has gone through several stages. For one, symbolic reasoning methods represent knowledge using symbolic logic and perform inference using algorithms such as forward chaining and backward chaining. A major problem with this approach is that these methods cannot handle the uncertainty in the data provided. Consequently, the rapid developments in the discipline of deep learning and neural reasoning methods has attracted much attention by the AI-research community. Neuro-Symbolic reasoning methods represent knowledge symbols such as entities, relationships, actions, logical functions and formulas to allow the model to perform end-to-end learning. Neural evidence reasoning does not require the reasoning layer to be logical, hence they can leverage both symbolic and non-symbolic evidence at the same time. Many of these principles are already successfully applied in real-world scenarios such as expert systems, medical diagnosis, knowledge base completion, question answering, search engine and  fact checking. With further advances in knowledge-based engineering and modelling neural-symbolic computing, it can be assumed that machine reasoning will improve beyond its current limitations.

How to Evaluate Abstract Reasoning in AI

According to an Essay written by Ben Dickson Can GPT-4 and GPT-4V perform abstract reasoning like humans? – (, abstract reasoning with AI includes the capacity to discern a rule or pattern from sparse data and to extrapolate it to novel and unique scenarios. Evaluating abstract reasoning capabilities is difficult. One generally accepted measure is the Abstraction and Reasoning Corpus (ARC) developed by Francois Chollet, senior staff software Engineer at Google. ARC is a framework for assessing abstract reasoning compared to the reasoning abilities of humans. The test comprises 1’000 handcrafted puzzle pieces and an incomplete grid that the solver must correctly fill in with the puzzle pieces available. The purpose behind the design of these puzzles is to negate any unfair advantage, such as similarity to training data or reliance on extraneous knowledge. To solve these puzzles, one must mentally apply abstract reasoning based on a few demonstrations provided by a tutorial and to use this knowledge to solve a new puzzle with the procedures presented. The foundational knowledge required to tackle ARC is believed to be similar to the human capacity for reasoning, encompassing concepts like object recognition, quantity assessment, and basic principles of geometry and topology. Human performance using the ARC concept of solving puzzles has been measured in multiple tests, reaching an average of 84%. In contrast, attempts to solve ARC with current AI systems have been quite disappointing, challenging researchers to improve the model.

Testing GPT-4 on Reasoning Tasks

Interdisciplinary researchers from the Santa-Fe Institute in New Mexico performed experiments with ConceptARC, a variant of ARC designed to be more accessible for human participants and to facilitate the assessment of specific conceptual ideas. To adapt ConceptARC to a text-based GPT-4 system, visual puzzles were translated into character sequences through prompts. GPT-4’s task was to generate a character sequence representing the solution to a new problem, with up to three attempts allowed. A previous test showed GPT-4 scoring 19% with ARC and 25% with ConceptARC. With this new and more comprehensive prompting technique, the results improved. When tested across all 480 ConceptARC tasks, GPT-4’s average performance reached approximately 33%. Despite this progress, GPT-4’s performance remains well below the high performance of humans, supporting the conclusion that, even with more informative prompting, the system lacks basic abstract reasoning abilities. The results support the hypothesis that GPT-4, perhaps the most capable ‘general’ LLM currently available, is still not able to robustly form abstractions and reason about the context of data that was previously not available.

Conclusion: Towards the Convergence of Human and Machine Reasoning

Humans have always used tools for solving physical or mental problems. The wheel, the steam engine, the computer: Each represented a leap in technology. Today’s trajectory of advancements in AI, points to a deepening symbiosis between humanity and its tools. As we stand on the brink of this transformative era, challenges and opportunities loom large. How will we uphold the dignity and essence of humanity in an age where our thoughts might seamlessly intertwine with algorithms?  We are challenged to redefine what it means to be human in a world where humans and machines converge. Throughout  history new tools have served as milestones for progress reflecting about where we stood and, more importantly so, as a guide which path to follow. As AI grows in capability and performance we continue to be challenged to integrate and adapt our views about the nature of creativity, intelligence, and reasoning.

Leave a Reply

Your email address will not be published. Required fields are marked *