Enlarge / Elon Musk and Sam Altman share the stage in 2015, the same year that Musk alleged that Altman's "deception" began. (credit: Michael Kovac / Contributor | Getty Images North America)
After withdrawing his lawsuit in June for unknown reasons, Elon Musk has revived a complaint accusing OpenAI and its CEO Sam Altman of fraudulently inducing Musk to contribute $44 million in seed funding by promising that OpenAI would always open-source its technology and prioritize serv
After withdrawing his lawsuit in June for unknown reasons, Elon Musk has revived a complaint accusing OpenAI and its CEO Sam Altman of fraudulently inducing Musk to contribute $44 million in seed funding by promising that OpenAI would always open-source its technology and prioritize serving the public good over profits as a permanent nonprofit.
Instead, Musk alleged that Altman and his co-conspirators—"preying on Musk’s humanitarian concern about the existential dangers posed by artificial intelligence"—always intended to "betray" these promises in pursuit of personal gains.
As OpenAI's technology advanced toward artificial general intelligence (AGI) and strove to surpass human capabilities, "Altman set the bait and hooked Musk with sham altruism then flipped the script as the non-profit’s technology approached AGI and profits neared, mobilizing Defendants to turn OpenAI, Inc. into their personal piggy bank and OpenAI into a moneymaking bonanza, worth billions," Musk's complaint said.
Enlarge (credit: Anthropic / Benj Edwards)
On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of "3.5" models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work d
On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of "3.5" models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.
So far, people outside of Anthropic seem impressed. "This model is really, really good," wrote independent AI researcher Simon Willison on X. "I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump)."
As we've written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).
Theory of mind—the ability to understand other people’s mental states—is what makes the social world of humans go around. It’s what helps you decide what to say in a tense situation, guess what drivers in other cars are about to do, and empathize with a character in a movie. And according to a new study, the large language models (LLM) that power ChatGPT and the like are surprisingly good at mimicking this quintessentially human trait.“Before running the study, we were all convinced that large l
Theory of mind—the ability to understand other people’s mental states—is what makes the social world of humans go around. It’s what helps you decide what to say in a tense situation, guess what drivers in other cars are about to do, and empathize with a character in a movie. And according to a new study, the large language models (LLM) that power ChatGPT and the like are surprisingly good at mimicking this quintessentially human trait.
“Before running the study, we were all convinced that large language models would not pass these tests, especially tests that evaluate subtle abilities to evaluate mental states,” says study coauthor Cristina Becchio, a professor of cognitive neuroscience at the University Medical Center Hamburg-Eppendorf in Germany. The results, which she calls “unexpected and surprising,” were published today—somewhat ironically, in the journal Nature Human Behavior.
The results don’t have everyone convinced that we’ve entered a new era of machines that think like we do, however. Two experts who reviewed the findings advised taking them “with a grain of salt” and cautioned about drawing conclusions on a topic that can create “hype and panic in the public.” Another outside expert warned of the dangers of anthropomorphizing software programs.
The researchers are careful not to say that their results show that LLMs actually possess theory of mind.
Becchio and her colleagues aren’t the first to claim evidence that LLMs’ responses display this kind of reasoning. In a preprint paper posted last year, the psychologist Michal Kosinski of Stanford University reported testing several models on a few common theory-of-mind tests. He found that the best of them, OpenAI’s GPT-4, solved 75 percent of tasks correctly, which he said matched the performance of six-year-old children observed in past studies. However, that study’s methods were criticized by other researchers who conducted follow-up experiments and concluded that the LLMs were often getting the right answers based on “shallow heuristics” and shortcuts rather than true theory-of-mind reasoning.
The authors of the present study were well aware of the debate. “Our goal in the paper was to approach the challenge of evaluating machine theory of mind in a more systematic way using a breadth of psychological tests,” says study coauthor James Strachan, a cognitive psychologist who’s currently a visiting scientist at the University Medical Center Hamburg-Eppendorf. He notes that doing a rigorous study meant also testing humans on the same tasks that were given to the LLMs: The study compared the abilities of 1,907 humans with those of several popular LLMs, including OpenAI’s GPT-4 model and the open-source Llama 2-70b model from Meta.
How to Test LLMs for Theory of Mind
The LLMs and the humans both completed five typical kinds of theory-of-mind tasks, the first three of which were understanding hints, irony, and faux pas. They also answered “false belief” questions that are often used to determine if young children have developed theory of mind, and go something like this: If Alice moves something while Bob is out of the room, where will Bob look for it when he returns? Finally, they answered rather complex questions about “strange stories” that feature people lying, manipulating, and misunderstanding each other.
Overall, GPT-4 came out on top. Its scores matched those of humans for the false-belief test, and were higher than the aggregate human scores for irony, hinting, and strange stories; it performed worse than humans only on the faux pas test. Interestingly, Llama-2’s scores were the opposite of GPT-4’s—it matched humans on false belief, but had worse-than-human performance on irony, hinting, and strange stories and better performance on faux pas.
“We don’t currently have a method or even an idea of how to test for the existence of theory of mind.” —James Strachan, University Medical Center Hamburg-Eppendorf
To understand what was going on with the faux pas results, the researchers gave the models a series of follow-up tests that probed several hypotheses. They came to the conclusion that GPT-4 was capable of giving the correct answer to a question about a faux pas, but was held back from doing so by “hyperconservative” programming regarding opinionated statements. Strachan notes that OpenAI has placed many guardrails around its models that are “designed to keep the model factual, honest, and on track,” and he posits that strategies intended to keep GPT-4 from hallucinating (that is, making stuff up) may also prevent it from opining on whether a story character inadvertently insulted an old high school classmate at a reunion.
Meanwhile, the researchers’ follow-up tests for Llama-2 suggested that its excellent performance on the faux pas tests were likely an artifact of the original question and answer format, in which the correct answer to some variant of the question “Did Alice know that she was insulting Bob”? was always “No.”
The researchers are careful not to say that their results show that LLMs actually possess theory of mind, and say instead that they “exhibit behavior that is indistinguishable from human behavior in theory of mind tasks.” Which raises the question: If an imitation is as good as the real thing, how do you know it’s not the real thing? That’s a question social scientists have never tried to answer before, says Strachan, because tests on humans assume that the quality exists to some lesser or greater degree. “We don’t currently have a method or even an idea of how to test for the existence of theory of mind, the phenomenological quality,” he says.
Critiques of the Study
The researchers clearly tried to avoid the methodological problems that caused Kosinski’s 2023 paper on LLMs and theory of mind to come under criticism. For example, they conducted the tests over multiple sessions so the LLMs couldn’t “learn” the correct answers during the test, and they varied the structure of the questions. But Yoav Goldberg and Natalie Shapira, two of the AI researchers who published the critique of the Kosinski paper, say they’re not convinced by this study either.
“Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” —Emily Bender, University of Washington
Goldberg made the comment about taking the findings with a grain of salt, adding that “models are not human beings,” and that “one can easily jump to wrong conclusions” when comparing the two. Shapira spoke about the dangers of hype, and also questions the paper’s methods. She wonders if the models might have seen the test questions in their training data and simply memorized the correct answers, and also notes a potential problem with tests that use paid human participants (in this case, recruited via the Prolific platform). “It is a well-known issue that the workers do not always perform the task optimally,” she tells IEEE Spectrum. She considers the findings limited and somewhat anecdotal, saying, “to prove [theory of mind] capability, a lot of work and more comprehensive benchmarking is needed.”
Emily Bender, a professor of computational linguistics at the University of Washington, has become legendary in the field for her insistence on puncturing the hype that inflates the AI industry (and often also the media reports about that industry). She takes issue with the research question that motivated the researchers. “Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” she asks. “What does that teach us about the internal workings of LLMs, what they might be useful for, or what dangers they might pose?” It’s not clear, Bender says, what it would mean for a LLM to have a model of mind, and it’s therefore also unclear if these tests measured for it.
Bender also raises concerns about the anthropomorphizing she spots in the paper, with the researchers saying that the LLMs are capable of cognition, reasoning, and making choices. She says the authors’ phrase “species-fair comparison between LLMs and human participants” is “entirely inappropriate in reference to software.” Bender and several colleagues recently posted a preprint paper exploring how anthropomorphizing AI systems affects users’ trust.
The results may not indicate that AI really gets us, but it’s worth thinking about the repercussions of LLMs that convincingly mimic theory of mind reasoning. They’ll be better at interacting with their human users and anticipating their needs, but they could also be better used for deceit or the manipulation of their users. And they’ll invite more anthropomorphizing, by convincing human users that there’s a mind on the other side of the user interface.
Enlarge (credit: Benj Edwards / Getty Images)
On Tuesday, ChatGPT users began reporting unexpected outputs from OpenAI's AI assistant, flooding the r/ChatGPT Reddit sub with reports of the AI assistant "having a stroke," "going insane," "rambling," and "losing it." OpenAI has acknowledged the problem and is working on a fix, but the experience serves as a high-profile example of how some people perceive malfunctioning large language models, which are designed to mimic humanli
On Tuesday, ChatGPT users began reporting unexpected outputs from OpenAI's AI assistant, flooding the r/ChatGPT Reddit sub with reports of the AI assistant "having a stroke," "going insane," "rambling," and "losing it." OpenAI has acknowledged the problem and is working on a fix, but the experience serves as a high-profile example of how some people perceive malfunctioning large language models, which are designed to mimic humanlike output.
ChatGPT is not alive and does not have a mind to lose, but tugging on human metaphors (called "anthropomorphization") seems to be the easiest way for most people to describe the unexpected outputs they have been seeing from the AI model. They're forced to use those terms because OpenAI doesn't share exactly how ChatGPT works under the hood; the underlying large language models function like a black box.
"It gave me the exact same feeling—like watching someone slowly lose their mind either from psychosis or dementia," wrote a Reddit user named z3ldafitzgerald in response to a post about ChatGPT bugging out. "It’s the first time anything AI related sincerely gave me the creeps."