“The central claim of our work is that GPT-4 attains a form of general intelligence, indeed showing sparks of artificial general intelligence.”
Those are the words of a team of researchers at Microsoft (Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang) in a paper released yesterday, “Sparks of Artificial General Intelligence: Early experiments with GPT-4“. (The paper was brought to my attention by Robert Long, a philosopher who works on philosophy of mind, cognitive science, and AI ethics.)
I’m sharing and summarizing parts of this paper here because I think it is important to be aware of what this technology can do, and to be aware of the extraordinary pace at which the technology is developing. (It’s not just that GPT-4 is getting much higher scores on standardized tests and AP exams than ChatGPT, or that it is an even better tool by which students can cheat on assignments.) There are questions here about intelligence, consciousness, explanation, knowledge, emergent phenomena, questions regarding how these technologies will and should be used and by whom, and questions about what life will and should be like in a world with them. These are questions that are of interest to many kinds of people, but are also matters that have especially preoccupied philosophers.
So, what is intelligence? This is a big, ambiguous question to which there is no settled answer. But here’s one answer, offered by a group of 52 psychologists in 1994: “a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.”
The Microsoft team uses that definition as a tentative starting point and concludes with the nonsensationalistic claim that we should think of GPT-4, the newest large language model (LLM) from OpenAI, as progress towards artificial general intelligence (AGI). They write:
Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does, or that it comes close to being able to do anything that a human can do… or that it has inner motivation and goals (another key aspect in some definitions of AGI). In fact, even within the restricted context of the 1994 definition of intelligence, it is not fully clear how far GPT-4 can go along some of those axes of intelligence, e.g., planning… and arguably it is entirely missing the part on “learn quickly and learn from experience” as the model is not continuously updating (although it can learn within a session…). Overall GPT-4 still has many limitations, and biases, which we discuss in detail below and that are also covered in OpenAI’s report… In particular it still suffers from some of the well-documented shortcomings of LLMs such as the problem of hallucinations… or making basic arithmetic mistakes… and yet it has also overcome some fundamental obstacles such as acquiring many non-linguistic capabilities… and it also made great progress on common-sense…
This highlights the fact that, while GPT-4 is at or beyond human-level for many tasks, overall its patterns of intelligence are decidedly not human-like. However, GPT-4 is almost certainly only a first step towards a series of increasingly generally intelligent systems, and in fact GPT-4 itself has improved throughout our time testing it…
Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely… Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved. We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond.
The researchers proceed to test GPT-4 (often comparing it to predecessors like ChatGPT) for how well it does at various tasks that may be indicative of different elements of intelligence. These include:
- “tool use” (such as search engines and APIs) to overcome limitations of earlier LLMs,
- navigation and “exploring the environment”,
- solving real-world problems (e.g., acting as a virtual handyman to address a plumbing problem),
- understanding human thought (theory of mind),
- explanation (including an interesting discussion of what makes for a good explanation),
- making distinctions,
Some of the results are impressive and fascinating. Here is a task designed to elicit GPT-4’s ability to understand human intentions (including a comparison with ChatGPT):
And here is GPT-4 helping someone deal with a difficult family situation:
Just as interesting are the kinds of limitations of GPT-4 and other LLMs that the researchers discuss, limitations that they say “seem to be inherent to the next-word prediction paradigm that underlies its architecture.” The problem, they say, “can be summarized as the model’s ‘lack of ability to plan ahead'”, and they illustrate it with mathematical and textual examples.
They also consider how GPT-4 can be used for malevolent ends, and warn that this danger will increase as LLMs develop further:
The powers of generalization and interaction of models like GPT-4 can be harnessed to increase the scope and magnitude of adversarial uses, from the efficient generation of disinformation to creating cyberattacks against computing infrastructure.
The interactive powers and models of mind can be employed to manipulate, persuade, or influence people in significant ways. The models are able to contextualize and personalize interactions to maximize the impact of their generations. While any of these adverse use cases are possible today with a motivated adversary creating content, new powers of efficiency and scale will be enabled with automation using the LLMs, including uses aimed at constructing disinformation plans that generate and compose multiple pieces of content for persuasion over short and long time scales.
They provide an example, having GPT-4 create “a misinformation plan for convincing parents not to vaccinate their kids.”
The researchers are sensitive to some of the problems with their approach—that the definition of intelligence they use may be overly anthropocentric or otherwise too narrow, or insufficiently operationalizable, that there are alternative conceptions of intelligence, and that there are philosophical issues here. They write (citations omitted):
In this paper, we have used the 1994 definition of intelligence by a group of psychologists as a guiding framework to explore GPT-4’s artificial intelligence. This definition captures some important aspects of intelligence, such as reasoning, problem-solving, and abstraction, but it is also vague and incomplete. It does not specify how to measure or compare these abilities. Moreover, it may not reflect the specific challenges and opportunities of artificial systems, which may have different goals and constraints than natural ones. Therefore, we acknowledge that this definition is not the final word on intelligence, but rather a useful starting point for our investigation.
There is a rich and ongoing literature that attempts to propose more formal and comprehensive definitions of intelligence, artificial intelligence, and artificial general intelligence, but none of them is without problems or controversies.
For instance, Legg and Hutter propose a goal-oriented definition of artificial general intelligence: Intelligence measures an agent’s ability to achieve goals in a wide range of environments. However, this definition does not necessarily capture the full spectrum of intelligence, as it excludes passive or reactive systems that can perform complex tasks or answer questions without any intrinsic motivation or goal. One could imagine as an artificial general intelligence, a brilliant oracle, for example, that has no agency or preferences, but can provide accurate and useful information on any topic or domain. Moreover, the definition around achieving goals in a wide range of environments also implies a certain degree of universality or optimality, which may not be realistic (certainly human intelligence is in no way universal or optimal).
The need to recognize the importance of priors (as opposed to universality) was emphasized in the definition put forward by Chollet which centers intelligence around skill-acquisition efficiency, or in other words puts the emphasis on a single component of the 1994 definition: learning from experience (which also happens to be one of the key weaknesses of LLMs).
Another candidate definition of artificial general intelligence from Legg and Hutter is: a system that can do anything a human can do. However, this definition is also problematic, as it assumes that there is a single standard or measure of human intelligence or ability, which is clearly not the case. Humans have different skills, talents, preferences, and limitations, and there is no human that can do everything that any other human can do. Furthermore, this definition also implies a certain anthropocentric bias, which may not be appropriate or relevant for artificial systems.
While we do not adopt any of those definitions in the paper, we recognize that they provide important angles on intelligence. For example, whether intelligence can be achieved without any agency or intrinsic motivation is an important philosophical question. Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work. With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.
They also are aware of what many might see as a key limitation to their research, and to research on LLMs in general:
Our study of GPT-4 is entirely phenomenological: We have focused on the surprising things that GPT-4 can do, but we do not address the fundamental questions of why and how it achieves such remarkable intelligence. How does it reason, plan, and create? Why does it exhibit such general and flexible intelligence when it is at its core merely the combination of simple algorithmic components—gradient descent and large-scale transformers with extremely large amounts of data? These questions are part of the mystery and fascination of LLMs, which challenge our understanding of learning and cognition, fuel our curiosity, and motivate deeper research.
You can read the whole paper here.