Can we Grade AI and Compare it with Human Intelligence?

Posted by Peter Rudin on 13. December 2019 in Essay

Picture Credit:   The Bridge by Francois Chollet (Google), Founder of


There is no agreed definition or model of intelligence. According to the Collins English Dictionary, intelligence is ‘the ability to think, reason, and understand instead of doing things automatically or by instinct’. The Macmillan Dictionary defines it as ‘the ability to understand and think about things, and to gain and use knowledge’. We tend to think of intelligence in terms of analytical skills. But intelligence might be defined more in terms of searching or social skills. Skills that are most valued have changed over time. In the West, the emphasis has gradually shifted from language skills to more purely analytical skills, and it was only in 1960 that the Universities of Oxford and Cambridge dropped Latin as an entry requirement. In 1990, Peter Salovey and John Mayer published the seminal paper on emotional intelligence, and EQ quickly become as important as IQ.

With the resurgence of AI in the late nineties and the application of Deep Learning to solve specific cognitive problems, the question as to what constitutes intelligence has gained new momentum. How can one create intelligence artificially if we have no precise definition of what intelligence is? To advance current Narrow AI (NAI) towards Artificial General Intelligence (AGI) raises the question of how to measure and compare intelligence between machines and humans. After hundreds of years of philosophical debate, realizing AGI might finally settle the issue of definig cognitive intelligence.

Conversely this result will raise our awareness that emotional, social and other forms of intelligence are as vital to humans as cognition.

Problems with the Turing Test 

Alan Turing has contributed vast amounts of essential insights to the field of AI. In 1950 he asked the question “Can machines think?” which he then replaced by a question that was easier to test: “Can a machine simulate human thinking, in such a way that a human judge cannot distinguish that machine from a human?”.  According to his test a human being should be unable to distinguish the machine from another human when answering questions. In 2012, a chatbot named ‘Eugene Goostman’ passed the Turing Test, by fooling 33% of the human judges that this chatbot was in fact human. Some argued that the test was not valid, as pretending to be a 13-year old Ukrainian boy is far easier than to simulate a native English-speaking adult. By pretending to be a young foreign child, the chatbot could hide behind grammatical mistakes and a lacking vocabulary. The Turing Test and its many variants (e.g. Total Turing Test or Loebner Prize) are not useful for measuring progress in AI. Instead of defining and measuring intelligence, the tests outsource the tasks to unreliable human judges who themselves do not have clear definitions or evaluation standards.

Grading the intelligence of machines

As employees can be assessed based on the results of an IQ test, it seems reasonable to test the IQ level of AI services provided by tech-giants like Amazon, Google, Apple, Baidu or Microsoft.

Comparing these companies’ existing AI systems – and seeing how they stack up against human beings – has been difficult. In a paper, published by a trio of Chinese researchers, including Yong Shi, executive deputy director of the Chinese Academy of Sciences, the researchers provide results of a test based on what they call the ‘standard intelligence model’. Applying this new model, AI systems must have a way of obtaining data from the outside world; they must be able to transform the data into a form that they can process; they must be able to use this knowledge in an innovative way; and finally, they must feed the resultant knowledge back into the outside world. The test the researchers developed measures a system’s ability to handle these tasks similar to an IQ test.  Applying the same principles allowed them to grade the systems on the same scale. According to their studies conducted in 2016, Google’s AI had an IQ of 47.28. It came out ahead of Chinese search engine Baidu (32.92) and Microsoft’s Bing (31.98) and had almost double the IQ of Siri (23.94). Notably, none of these systems had a higher IQ than a 6-year-old (55.5), much less than an 18-year-old (97).

Towards a new perspective of defining and evaluating intelligence

The goal of AGI (Artificial General Intelligence) is to overcome the limits of NAI (Narrow Artificial Intelligence), experienced in applications such as image recognition or event prediction. Getting there requires a thorough understanding of cognitive intelligence. While intelligence is measured against a specific application, two intelligent systems may only be meaningfully compared if they share similar prerequisites.  As such, AGI should also be benchmarked against human intelligence and should be based on a similar set of previously acquired knowledge (e.g. Core Knowledge).  To handle this task there are currently two opposing views:

  • One view in which the mind is a relatively static assembly of special-purpose mechanisms developed by evolution, only capable of learning what is it programmed to acquire.
  • Another view in which the mind is a general-purpose processor, capable of turning arbitrary experience into knowledge and skills that could be directed at any problem.

In other words, intelligence can be defined in two ways: one with an emphasis on task-specific skills, achieving a predefined goal, and another focused on generality and adaptation in a wide range of environments.

Intelligence as a collection of task-specific skills

The evolutionary psychology view of human nature is that much of the human cognitive function is the result of special-purpose adaptations that arose to solve specific problems encountered by humans throughout their evolution – an idea which originated with Darwin. This vision of the mind as a wide collection of vertical, relatively static programs that collectively define intelligence, was endorsed by the influential AI pioneer Marvin Minsky (see also ‘The Society of Mind’). His view gave rise to definitions of intelligence that are focused on task-specific performance. It is perhaps best illustrated by Minsky’s 1968 definition of AI: “AI is the science of making machines capable of performing tasks that would require intelligence if done by humans”. Following this statement, the AI community widely accepted the view that the “problem of intelligence” would be solved if only we could encode human skills into formal rules and encode human knowledge into explicit databases.

Intelligence as a general learning ability

In contrast to Minsky’s views, a number of researchers have taken the position that intelligence lies in the general ability to acquire new skills through learning; an ability that could be directed to a wide range of previously unknown problems. The notion that machines could acquire new skills through a learning process similar to that of human children was initially presented by Turing in his 1950 paper. In 1958, Richard M. Friedberg a theoretical physicist who has contributed to a variety of solutions for problems in mathematics and physics, noted: “If we are ever to make a machine that will speak, understand or translate human languages, solve mathematical problems with imagination, practice a profession or direct an organization, either we must reduce these activities to a science so exact that we can tell a machine precisely how to go about doing them or we must develop a machine that can do things without being told precisely how”.

A unified approach to measure intelligence of humans and machines

Today, it is increasingly apparent that both views of human intelligence – either a collection of task-specific skills or a collection of general learning ability – are not enough to achieve AGI. To provide more intelligent and more human-like artificial systems, we need to be able to define and evaluate intelligence in a way that enables comparisons between systems, as well as comparisons with humans. In his new scientific paper ‘On the Measure of Intelligence’, Francois Chollet, creator of Google’s Keras software, articulates a new formal definition of intelligence based on Algorithmic Information Theory. He defines intelligence as ‘skill-acquisition efficiency’, highlighting the concepts of scope, incorporating core knowledge and experience.  Using this definition, he proposes a set of guidelines for what a general AI benchmark should look like by introducing the concept of an ‘Abstraction and Reasoning Corpus (ARC)’. Built upon an explicit dataset of core knowledge, ARC is designed to be as close as possible to innate human knowledge and experiences. Unlike existing IQ intelligence tests, ARC is not interested in assessing crystallized (non-dynamic, fixed) intelligence or crystallized cognitive abilities. ARC only assesses a general form of fluid intelligence, with a focus on reasoning and abstraction. ARC does not involve language, pictures of real-world objects or real-world common sense. ARC seeks to only involve knowledge that stays close to Core-Knowledge and experience. According to Chollet, ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.


With the resurgence of machine learning in the 1980s, its rise to intellectual dominance in the 2000s, and its peak as an intellectual quasi-monopoly in AI in the late 2010s, Deep Learning is increasingly becoming the dominant philosophical framework in which AI research is taking place. Many researchers are implicitly conceptualizing the mind via the metaphor of a “randomly initialized neural network” that starts blank and that derives its skills from “training data”. The fact that this approach leads into a one-way street, providing black-box results which cannot be explained, promotes the effort to develop human centred AI and to move away from a purely data centric approach.

Applying a framework like Chollet’s ‘Abstraction and Reasoning Corpus (ARC)’ is likely to empower AGI research with a new perspective on defining and evaluating intelligence, structured around the following ideas:

  • Intelligence is the efficiency with which a learning system turns experience and already existing knowledge into skill to solve previously unknown tasks.
  • All intelligence is relative to a scope of application. Two intelligent systems may only

be meaningfully compared within a shared scope and if they share similar priors such as core-knowledge and experience.

  • Human and machine intelligence can be compared based on the same principles and concepts.

Research measuring AI performance against human intelligence will eventually advance AGI to the point where artificial cognitive intelligence will become a commodity and tool, providing intelligence as a service. Humans have benefitted greatly by the introduction of machines to enhance or replace muscle power. With a positive mindset we will learn how to augment our own intelligence and brain power with AGI, taking another step in advancing human evolution.

Leave a Reply

Your email address will not be published. Required fields are marked *