Tech companies have been caught up in a race to build the biggest large language models (LLMs). In April, for example, Meta announced the 400-billion-parameter Llama 3, which contains twice the number of parameters—or variables that determine how the model responds to queries—than OpenAI’s original ChatGPT model from 2022. Although not confirmed, GPT-4 is estimated to have about 1.8 trillion parameters.
In the last few months, however, some of the largest tech companies, including Apple and Microsoft, have introduced small language models (SLMs). These models are a fraction of the size of their LLM counterparts and yet, on many benchmarks, can match or even outperform them in text generation.
On 10 June, at Apple’s Worldwide Developers Conference, the company announced its “Apple Intelligence” models, which have around 3 billion parameters. And in late April, Microsoft released its Phi-3 family of SLMs, featuring models housing between 3.8 billion and 14 billion parameters.
OpenAI’s CEO Sam Altman believes we’re at the end of the era of giant models.
In a series of tests, the smallest of Microsoft’s models, Phi-3-mini, rivalled OpenAI’s GPT-3.5 (175 billion parameters), which powers the free version of ChatGPT, and outperformed Google’s Gemma (7 billion parameters). The tests evaluated how well a model understands language by prompting it with questions about mathematics, philosophy, law, and more. What’s more interesting, Microsoft’s Phi-3-small, with 7 billion parameters, fared remarkably better than GPT-3.5 in many of these benchmarks.
Aaron Mueller, who researches language models at Northeastern University in Boston, isn’t surprised SLMs can go toe-to-toe with LLMs in select functions. He says that’s because scaling the number of parameters isn’t the only way to improve a model’s performance: Training it on higher-quality data can yield similar results too.
Microsoft’s Phi models were trained on fine-tuned “textbook-quality” data, says Mueller, which have a more consistent style that’s easier to learn from than the highly diverse text from across the Internet that LLMs typically rely on. Similarly, Apple trained its SLMs exclusively on richer and more complex datasets.
The rise of SLMs comes at a time when the performance gap between LLMs is quickly narrowing and tech companies look to deviate from standard scaling laws and explore other avenues for performance upgrades. At an event in April, OpenAI’s CEO Sam Altman said he believes we’re at the end of the era of giant models. “We’ll make them better in other ways.”
Because SLMs don’t consume nearly as much energy as LLMs, they can also run locally on devices like smartphones and laptops (instead of in the cloud) to preserve data privacy and personalize them to each person. In March, Google rolled out Gemini Nano to the company’s Pixel line of smartphones. The SLM can summarize audio recordings and produce smart replies to conversations without an Internet connection. Apple is expected to follow suit later this year.
More importantly, SLMs can democratize access to language models, says Mueller. So far, AI development has been concentrated into the hands of a couple of large companies that can afford to deploy high-end infrastructure, while other, smaller operations and labs have been forced to license them for hefty fees.
Since SLMs can be easily trained on more affordable hardware, says Mueller, they’re more accessible to those with modest resources and yet still capable enough for specific applications.
In addition, while researchers agree there’s still a lot of work ahead to overcome hallucinations, carefully curated SLMs bring them a step closer toward building responsible AI that is also interpretable, which would potentially allow researchers to debug specific LLM issues and fix them at the source.
For researchers like Alex Warstadt, a computer science researcher at ETH Zurich, SLMs could also offer new, fascinating insights into a longstanding scientific question: How children acquire their first language. Warstadt, alongside a group of researchers including Northeastern’s Mueller, organizes BabyLM, a challenge in which participants optimize language-model training on small data.
Not only could SLMs potentially unlock new secrets of human cognition, but they also help improve generative AI. By the time children turn 13, they’re exposed to about 100 million words and are better than chatbots at language, with access to only 0.01 percent of the data. While no one knows what makes humans so much more efficient, says Warstadt, “reverse engineering efficient humanlike learning at small scales could lead to huge improvements when scaled up to LLM scales.”
Last month when Google introduced its new AI search tool, called AI Overviews, the company seemed confident that it had tested the tool sufficiently, noting in the announcement that “people have already used AI Overviews billions of times through our experiment in Search Labs.” The tool doesn’t just return links to Web pages, as in a typical Google search, but returns an answer that it has generated based on various sources, which it links to below the answer. But immediately after the launch users began posting examples of extremely wrong answers, including a pizza recipe that included glue and the interesting fact that a dog has played in the NBA.
Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory.
While the pizza recipe is unlikely to convince anyone to squeeze on the Elmer’s, not all of AI Overview’s extremely wrong answers are so obvious—and some have the potential to be quite harmful. Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory and has a new book out about the online propagandists who “turn lies into reality.” She has studied the spread of medical misinformation via social media, so IEEE Spectrum spoke to her about whether AI search is likely to bring an onslaught of erroneous medical advice to unwary users.
I know you’ve been tracking disinformation on the Web for many years. Do you expect the introduction of AI-augmented search tools like Google’s AI Overviews to make the situation worse or better?
Renée DiResta: It’s a really interesting question. There are a couple of policies that Google has had in place for a long time that appear to be in tension with what’s coming out of AI-generated search. That’s made me feel like part of this is Google trying to keep up with where the market has gone. There’s been an incredible acceleration in the release of generative AI tools, and we are seeing Big Tech incumbents trying to make sure that they stay competitive. I think that’s one of the things that’s happening here.
We have long known that hallucinations are a thing that happens with large language models. That’s not new. It’s the deployment of them in a search capacity that I think has been rushed and ill-considered because people expect search engines to give them authoritative information. That’s the expectation you have on search, whereas you might not have that expectation on social media.
There are plenty of examples of comically poor results from AI search, things like how many rocks we should eat per day [a response that was drawn for an Onion article]. But I’m wondering if we should be worried about more serious medical misinformation. I came across one blog post about Google’s AI Overviews responses about stem-cell treatments. The problem there seemed to be that the AI search tool was sourcing its answers from disreputable clinics that were offering unproven treatments. Have you seen other examples of that kind of thing?
DiResta: I have. It’s returning information synthesized from the data that it’s trained on. The problem is that it does not seem to be adhering to the same standards that have long gone into how Google thinks about returning search results for health information. So what I mean by that is Google has, for upwards of 10 years at this point, had a search policy called Your Money or Your Life. Are you familiar with that?
I don’t think so.
DiResta: Your Money or Your Life acknowledges that for queries related to finance and health, Google has a responsibility to hold search results to a very high standard of care, and it’s paramount to get the information correct. People are coming to Google with sensitive questions and they’re looking for information to make materially impactful decisions about their lives. They’re not there for entertainment when they’re asking a question about how to respond to a new cancer diagnosis, for example, or what sort of retirement plan they should be subscribing to. So you don’t want content farms and random Reddit posts and garbage to be the results that are returned. You want to have reputable search results.
That framework of Your Money or Your Life has informed Google’s work on these high-stakes topics for quite some time. And that’s why I think it’s disturbing for people to see the AI-generated search results regurgitating clearly wrong health information from low-quality sites that perhaps happened to be in the training data.
So it seems like AI overviews is not following that same policy—or that’s what it appears like from the outside?
DiResta: That’s how it appears from the outside. I don’t know how they’re thinking about it internally. But those screenshots you’re seeing—a lot of these instances are being traced back to an isolated social media post or a clinic that’s disreputable but exists—are out there on the Internet. It’s not simply making things up. But it’s also not returning what we would consider to be a high-quality result in formulating its response.
I saw that Google responded to some of the problems with a blog post saying that it is aware of these poor results and it’s trying to make improvements. And I can read you the one bullet point that addressed health. It said, “For topics like news and health, we already have strong guardrails in place. In the case of health, we launched additional triggering refinements to enhance our quality protections.” Do you know what that means?
DiResta: That blog posts is an explanation that [AI Overviews] isn’t simply hallucinating—the fact that it’s pointing to URLs is supposed to be a guardrail because that enables the user to go and follow the result to its source. This is a good thing. They should be including those sources for transparency and so that outsiders can review them. However, it is also a fair bit of onus to put on the audience, given the trust that Google has built up over time by returning high-quality results in its health information search rankings.
I know one topic that you’ve tracked over the years has been disinformation about vaccine safety. Have you seen any evidence of that kind of disinformation making its way into AI search?
DiResta: I haven’t, though I imagine outside research teams are now testing results to see what appears. Vaccines have been so much a focus of the conversation around health misinformation for quite some time, I imagine that Google has had people looking specifically at that topic in internal reviews, whereas some of these other topics might be less in the forefront of the minds of the quality teams that are tasked with checking if there are bad results being returned.
What do you think Google’s next moves should be to prevent medical misinformation in AI search?
DiResta: Google has a perfectly good policy to pursue. Your Money or Your Life is a solid ethical guideline to incorporate into this manifestation of the future of search. So it’s not that I think there’s a new and novel ethical grounding that needs to happen. I think it’s more ensuring that the ethical grounding that exists remains foundational to the new AI search tools.
For years, Nvidia has dominated many machine learning benchmarks, and now there are two more notches in its belt.
MLPerf, the AI benchmarking suite sometimes called “the Olympics of machine learning,” has released a new set of training tests to help make more and better apples-to-apples comparisons between competing computer systems. One of MLPerf’s new tests concerns fine-tuning of large language models, a process that takes an existing trained model and trains it a bit more with specialized knowledge to make it fit for a particular purpose. The other is for graph neural networks, a type of machine learning behind some literature databases, fraud detection in financial systems, and social networks.
Even with the additions and the participation of computers using Google’s and Intel’s AI accelerators, systems powered by Nvidia’s Hopper architecture dominated the results once again. One system that included 11,616 Nvidia H100 GPUs—the largest collection yet—topped each of the nine benchmarks, setting records in five of them (including the two new benchmarks).
“If you just throw hardware at the problem, it’s not a given that you’re going to improve.” —Dave Salvator, Nvidia
The 11,616-H100 system is “the biggest we’ve ever done,” says Dave Salvator, director of accelerated computing products at Nvidia. It smashed through the GPT-3 training trial in less than 3.5 minutes. A 512-GPU system, for comparison, took about 51 minutes. (Note that the GPT-3 task is not a full training, which could take weeks and cost millions of dollars. Instead, the computers train on a representative portion of the data, at an agreed-upon point well before completion.)
Compared to Nvidia’s largest entrant on GPT-3 last year, a 3,584 H100 computer, the 3.5-minute result represents a 3.2-fold improvement. You might expect that just from the difference in the size of these systems, but in AI computing that isn’t always the case, explains Salvator. “If you just throw hardware at the problem, it’s not a given that you’re going to improve,” he says.
“We are getting essentially linear scaling,” says Salvator. By that he means that twice as many GPUs lead to a halved training time. “[That] represents a great achievement from our engineering teams,” he adds.
Competitors are also getting closer to linear scaling. This round Intel deployed a system using 1,024 GPUs that performed the GPT-3 task in 67 minutes versus a computer one-fourth the size that took 224 minutes six months ago. Google’s largest GPT-3 entry used 12-times the number of TPU v5p accelerators as its smallest entry and performed its task nine times as fast.
Linear scaling is going to be particularly important for upcoming “AI factories” housing 100,000 GPUs or more, Salvator says. He says to expect one such data center to come online this year, and another, using Nvidia’s next architecture, Blackwell, to startup in 2025.
Nvidia’s streak continues
Nvidia continued to boost training times despite using the same architecture, Hopper, as it did in last year’s training results. That’s all down to software improvements, says Salvator. “Typically, we’ll get a 2-2.5x [boost] from software after a new architecture is released,” he says.
For GPT-3 training, Nvidia logged a 27 percent improvement from the June 2023 MLPerf benchmarks. Salvator says there were several software changes behind the boost. For example, Nvidia engineers tuned up Hopper’s use of less accurate, 8-bit floating point operations by trimming unnecessary conversions between 8-bit and 16-bit numbers and better targeting of which layers of a neural network could use the lower precision number format. They also found a more intelligent way to adjust the power budget of each chip’s compute engines, and sped communication among GPUs in a way that Salvator likened to “buttering your toast while it’s still in the toaster.”
Additionally, the company implemented a scheme called flash attention. Invented in the Stanford University laboratory of Samba Nova founder Chris Re, flash attention is an algorithm that speeds transformer networks by minimizing writes to memory. When it first showed up in MLPerf benchmarks, flash attention shaved as much as 10 percent from training times. (Intel, too, used a version of flash attention but not for GPT-3. It instead used the algorithm for one of the new benchmarks, fine-tuning.)
Using other software and network tricks, Nvidia delivered an 80 percent speedup in the text-to-image test, Stable Diffusion, versus its submission in November 2023.
New benchmarks
MLPerf adds new benchmarks and upgrades old ones to stay relevant to what’s happening in the AI industry. This year saw the addition of fine-tuning and graph neural networks.
Fine tuning takes an already trained LLM and specializes it for use in a particular field. Nvidia, for example took a trained 43-billion-parameter model and trained it on the GPU-maker’s design files and documentation to create ChipNeMo, an AI intended to boost the productivity of its chip designers. At the time, the company’s chief technology officer Bill Dally said that training an LLM was like giving it a liberal arts education, and fine tuning was like sending it to graduate school.
The MLPerf benchmark takes a pretrained Llama-2-70B model and asks the system to fine tune it using a dataset of government documents with the goal of generating more accurate document summaries.
There are several ways to do fine-tuning. MLPerf chose one called low-rank adaptation (LoRA). The method winds up training only a small portion of the LLM’s parameters leading to a 3-fold lower burden on hardware and reduced use of memory and storage versus other methods, according to the organization.
The other new benchmark involved a graph neural network (GNN). These are for problems that can be represented by a very large set of interconnected nodes, such as a social network or a recommender system. Compared to other AI tasks, GNNs require a lot of communication between nodes in a computer.
The benchmark trained a GNN on a database that shows relationships about academic authors, papers, and institutes—a graph with 547 million nodes and 5.8 billion edges. The neural network was then trained to predict the right label for each node in the graph.
Future fights
Training rounds in 2025 may see head-to-head contests comparing new accelerators from AMD, Intel, and Nvidia. AMD’s MI300 series was launched about six months ago, and a memory-boosted upgrade the MI325x is planned for the end of 2024, with the next generation MI350 slated for 2025. Intel says its Gaudi 3, generally available to computer makers later this year, will appear in MLPerf’s upcoming inferencing benchmarks. Intel executives have said the new chip has the capacity to beat H100 at training LLMs. But the victory may be short-lived, as Nvidia has unveiled a new architecture, Blackwell, which is planned for late this year.
On Thursday, Anthropic announced Claude 3.5 Sonnet, its latest AI language model and the first in a new series of "3.5" models that build upon Claude 3, launched in March. Claude 3.5 can compose text, analyze data, and write code. It features a 200,000 token context window and is available now on the Claude website and through an API. Anthropic also introduced Artifacts, a new feature in the Claude interface that shows related work documents in a dedicated window.
So far, people outside of Anthropic seem impressed. "This model is really, really good," wrote independent AI researcher Simon Willison on X. "I think this is the new best overall model (and both faster and half the price of Opus, similar to the GPT-4 Turbo to GPT-4o jump)."
As we've written before, benchmarks for large language models (LLMs) are troublesome because they can be cherry-picked and often do not capture the feel and nuance of using a machine to generate outputs on almost any conceivable topic. But according to Anthropic, Claude 3.5 Sonnet matches or outperforms competitor models like GPT-4o and Gemini 1.5 Pro on certain benchmarks like MMLU (undergraduate level knowledge), GSM8K (grade school math), and HumanEval (coding).
On Monday, Apple debuted "Apple Intelligence," a new suite of free AI-powered features for iOS 18, iPadOS 18, macOS Sequoia that includes creating email summaries, generating images and emoji, and allowing Siri to take actions on your behalf. These features are achieved through a combination of on-device and cloud processing, with a strong emphasis on privacy. Apple says that Apple Intelligence features will be widely available later this year and will be available as a beta test for developers this summer.
The announcements came during a livestream WWDC keynote and a simultaneous event attended by the press on Apple's campus in Cupertino, California. In an introduction, Apple CEO Tim Cook said the company has been using machine learning for years, but the introduction of large language models (LLMs) presents new opportunities to elevate the capabilities of Apple products. He emphasized the need for both personalization and privacy in Apple's approach.
At last year's WWDC, Apple avoided using the term "AI" completely, instead preferring terms like "machine learning" as Apple's way of avoiding buzzy hype while integrating applications of AI into apps in useful ways. This year, Apple figured out a new way to largely avoid the abbreviation "AI" by coining "Apple Intelligence," a catchall branding term that refers to a broad group of machine learning, LLM, and image generation technologies. By our count, the term "AI" was used sparingly in the keynote—most notably near the end of the presentation when Apple executive Craig Federighi said, "It's AI for the rest of us."
But such conveniences barely hint at the massive, sweeping changes to employment predicted by some analysts. And already, in ways large and small, striking and subtle, the tech world’s notables are grappling with changes, both real and envisioned, wrought by the onset of generative AI. To get a better idea of how some of them view the future of generative AI, IEEE Spectrum asked three luminaries—an academic leader, a regulator, and a semiconductor industry executive—about how generative AI has begun affecting their work. The three, Andrea Goldsmith, Juraj Čorba, and Samuel Naffziger, agreed to speak with Spectrum at the 2024 IEEE VIC Summit & Honors Ceremony Gala, held in May in Boston.
Juraj Čorba, senior expert on digital regulation and governance, Slovak Ministry of Investments, Regional Development
Samuel Naffziger, senior vice president and a corporate fellow at Advanced Micro Devices
Andrea Goldsmith
Andrea Goldsmith is dean of engineering at Princeton University.
There must be tremendous pressure now to throw a lot of resources into large language models. How do you deal with that pressure? How do you navigate this transition to this new phase of AI?
Andrea J. Goldsmith
Andrea Goldsmith: Universities generally are going to be very challenged, especially universities that don’t have the resources of a place like Princeton or MIT or Stanford or the other Ivy League schools. In order to do research on large language models, you need brilliant people, which all universities have. But you also need compute power and you need data. And the compute power is expensive, and the data generally sits in these large companies, not within universities.
So I think universities need to be more creative. We at Princeton have invested a lot of money in the computational resources for our researchers to be able to do—well, not large language models, because you can’t afford it. To do a large language model… look at OpenAI or Google or Meta. They’re spending hundreds of millions of dollars on compute power, if not more. Universities can’t do that.
But we can be more nimble and creative. What can we do with language models, maybe not large language models but with smaller language models, to advance the state of the art in different domains? Maybe it’s vertical domains of using, for example, large language models for better prognosis of disease, or for prediction of cellular channel changes, or in materials science to decide what’s the best path to pursue a particular new material that you want to innovate on. So universities need to figure out how to take the resources that we have to innovate using AI technology.
We also need to think about new models. And the government can also play a role here. The [U.S.] government has this new initiative, NAIRR, or National Artificial Intelligence Research Resource, where they’re going to put up compute power and data and experts for educators to use—researchers and educators.
That could be a game-changer because it’s not just each university investing their own resources or faculty having to write grants, which are never going to pay for the compute power they need. It’s the government pulling together resources and making them available to academic researchers. So it’s an exciting time, where we need to think differently about research—meaning universities need to think differently. Companies need to think differently about how to bring in academic researchers, how to open up their compute resources and their data for us to innovate on.
As a dean, you are in a unique position to see which technical areas are really hot, attracting a lot of funding and attention. But how much ability do you have to steer a department and its researchers into specific areas? Of course, I’m thinking about large language models and generative AI. Is deciding on a new area of emphasis or a new initiative a collaborative process?
Goldsmith: Absolutely. I think any academic leader who thinks that their role is to steer their faculty in a particular direction does not have the right perspective on leadership. I describe academic leadership as really about the success of the faculty and students that you’re leading. And when I did my strategic planning for Princeton Engineering in the fall of 2020, everything was shut down. It was the middle of COVID, but I’m an optimist. So I said, “Okay, this isn’t how I expected to start as dean of engineering at Princeton.” But the opportunity to lead engineering in a great liberal arts university that has aspirations to increase the impact of engineering hasn’t changed. So I met with every single faculty member in the School of Engineering, all 150 of them, one-on-one over Zoom.
And the question I asked was, “What do you aspire to? What should we collectively aspire to?” And I took those 150 responses, and I asked all the leaders and the departments and the centers and the institutes, because there already were some initiatives in robotics and bioengineering and in smart cities. And I said, “I want all of you to come up with your own strategic plans. What do you aspire to in these areas? And then let’s get together and create a strategic plan for the School of Engineering.” So that’s what we did. And everything that we’ve accomplished in the last four years that I’ve been dean came out of those discussions, and what it was the faculty and the faculty leaders in the school aspired to.
So we launched a bioengineering institute last summer. We just launched Princeton Robotics. We’ve launched some things that weren’t in the strategic plan that bubbled up. We launched a center on blockchain technology and its societal implications. We have a quantum initiative. We have an AI initiative using this powerful tool of AI for engineering innovation, not just around large language models, but it’s a tool—how do we use it to advance innovation and engineering? All of these things came from the faculty because, to be a successful academic leader, you have to realize that everything comes from the faculty and the students. You have to harness their enthusiasm, their aspirations, their vision to create a collective vision.
What are the most important organizations and governing bodies when it comes to policy and governance on artificial intelligence in Europe?
Juraj Čorba
Juraj Čorba: Well, there are many. And it also creates a bit of a confusion around the globe—who are the actors in Europe? So it’s always good to clarify. First of all we have the European Union, which is a supranational organization composed of many member states, including my own Slovakia. And it was the European Union that proposed adoption of a horizontal legislation for AI in 2021. It was the initiative of the European Commission, the E.U. institution, which has a legislative initiative in the E.U. And the E.U. AI Act is now finally being adopted. It was already adopted by the European Parliament.
So this started, you said 2021. That’s before ChatGPT and the whole large language model phenomenon really took hold.
Čorba: That was the case. Well, the expert community already knew that something was being cooked in the labs. But, yes, the whole agenda of large models, including large language models, came up only later on, after 2021. So the European Union tried to reflect that. Basically, the initial proposal to regulate AI was based on a blueprint of so-called product safety, which somehow presupposes a certain intended purpose. In other words, the checks and assessments of products are based more or less on the logic of the mass production of the 20th century, on an industrial scale, right? Like when you have products that you can somehow define easily and all of them have a clearly intended purpose. Whereas with these large models, a new paradigm was arguably opened, where they have a general purpose.
So the whole proposal was then rewritten in negotiations between the Council of Ministers, which is one of the legislative bodies, and the European Parliament. And so what we have today is a combination of this old product-safety approach and some novel aspects of regulation specifically designed for what we call general-purpose artificial intelligence systems or models. So that’s the E.U.
By product safety, you mean, if AI-based software is controlling a machine, you need to have physical safety.
Čorba: Exactly. That’s one of the aspects. So that touches upon the tangible products such as vehicles, toys, medical devices, robotic arms, et cetera. So yes. But from the very beginning, the proposal contained a regulation of what the European Commission called stand-alone systems—in other words, software systems that do not necessarily command physical objects. So it was already there from the very beginning, but all of it was based on the assumption that all software has its easily identifiable intended purpose—which is not the case for general-purpose AI.
Also, large language models and generative AI in general brings in this whole other dimension, of propaganda, false information, deepfakes, and so on, which is different from traditional notions of safety in real-time software.
Čorba: Well, this is exactly the aspect that is handled by another European organization, different from the E.U., and that is the Council of Europe. It’s an international organization established after the Second World War for the protection of human rights, for protection of the rule of law, and protection of democracy. So that’s where the Europeans, but also many other states and countries, started to negotiate a first international treaty on AI. For example, the United States have participated in the negotiations, and also Canada, Japan, Australia, and many other countries. And then these particular aspects, which are related to the protection of integrity of elections, rule-of-law principles, protection of fundamental rights or human rights under international law—all these aspects have been dealt with in the context of these negotiations on the first international treaty, which is to be now adopted by the Committee of Ministers of the Council of Europe on the 16th and 17th of May. So, pretty soon. And then the first international treaty on AI will be submitted for ratifications.
So prompted largely by the activity in large language models, AI regulation and governance now is a hot topic in the United States, in Europe, and in Asia. But of the three regions, I get the sense that Europe is proceeding most aggressively on this topic of regulating and governing artificial intelligence. Do you agree that Europe is taking a more proactive stance in general than the United States and Asia?
Čorba: I’m not so sure. If you look at the Chinese approach and the way they regulate what we call generative AI, it would appear to me that they also take it very seriously. They take a different approach from the regulatory point of view. But it seems to me that, for instance, China is taking a very focused and careful approach. For the United States, I wouldn’t say that the United States is not taking a careful approach because last year you saw many of the executive orders, or even this year, some of the executive orders issued by President Biden. Of course, this was not a legislative measure, this was a presidential order. But it seems to me that the United States is also trying to address the issue very actively. The United States has also initiated the first resolution of the General Assembly at the U.N. on AI, which was passed just recently. So I wouldn’t say that the E.U. is more aggressive in comparison with Asia or North America, but maybe I would say that the E.U. is the most comprehensive. It looks horizontally across different agendas and it uses binding legislation as a tool, which is not always the case around the world. Many countries simply feel that it’s too early to legislate in a binding way, so they opt for soft measures or guidance, collaboration with private companies, et cetera. Those are the differences that I see.
Do you think you perceive a difference in focus among the three regions? Are there certain aspects that are being more aggressively pursued in the United States than in Europe or vice versa?
Čorba: Certainly the E.U. is very focused on the protection of human rights, the full catalog of human rights, but also, of course, on safety and human health. These are the core goals or values to be protected under the E.U. legislation. As for the United States and for China, I would say that the primary focus in those countries—but this is only my personal impression—is on national and economic security.
Samuel Naffziger
Samuel Naffziger is senior vice president and a corporate fellow at Advanced Micro Devices, where he is responsible for technology strategy and product architectures. Naffziger was instrumental in AMD’s embrace and development of chiplets, which are semiconductor dies that are packaged together into high-performance modules.
To what extent is large language model training starting to influence what you and your colleagues do at AMD?
Samuel Naffziger
Samuel Naffziger: Well, there are a couple levels of that. LLMs are impacting the way a lot of us live and work. And we certainly are deploying that very broadly internally for productivity enhancements, for using LLMs to provide starting points for code—simple verbal requests, such as “Give me a Python script to parse this dataset.” And you get a really nice starting point for that code. Saves a ton of time. Writing verification test benches, helping with the physical design layout optimizations. So there’s a lot of productivity aspects.
The other aspect to LLMs is, of course, we are actively involved in designing GPUs [graphics processing units] for LLM training and for LLM inference. And so that’s driving a tremendous amount of workload analysis on the requirements, hardware requirements, and hardware-software codesign, to explore.
So that brings us to your current flagship, the Instinct MI300X, which is actually billed as an AI accelerator. How did the particular demands influence that design? I don’t know when that design started, but the ChatGPT era started about two years ago or so. To what extent did you read the writing on the wall?
Naffziger: So we were just into the MI300—in 2019, we were starting the development. A long time ago. And at that time, our revenue stream from the Zen [an AMD architecture used in a family of processors] renaissance had really just started coming in. So the company was starting to get healthier, but we didn’t have a lot of extra revenue to spend on R&D at the time. So we had to be very prudent with our resources. And we had strategic engagements with the [U.S.] Department of Energy for supercomputer deployments. That was the genesis for our MI line—we were developing it for the supercomputing market. Now, there was a recognition that munching through FP64 COBOL code, or Fortran, isn’t the future, right? [laughs] This machine-learning [ML] thing is really getting some legs.
So we put some of the lower-precision math formats in, like Brain Floating Point 16 at the time, that were going to be important for inference. And the DOE knew that machine learning was going to be an important dimension of supercomputers, not just legacy code. So that’s the way, but we were focused on HPC [high-performance computing]. We had the foresight to understand that ML had real potential. Although certainly no one predicted, I think, the explosion we’ve seen today.
So that’s how it came about. And, just another piece of it: We leveraged our modular chiplet expertise to architect the 300 to support a number of variants from the same silicon components. So the variant targeted to the supercomputer market had CPUs integrated in as chiplets, directly on the silicon module. And then it had six of the GPU chiplets we call XCDs around them. So we had three CPU chiplets and six GPU chiplets. And that provided an amazingly efficient, highly integrated, CPU-plus-GPU design we call MI300A. It’s very compelling for the El Capitan supercomputer that’s being brought up as we speak.
But we also recognize that for the maximum computation for these AI workloads, the CPUs weren’t that beneficial. We wanted more GPUs. For these workloads, it’s all about the math and matrix multiplies. So we were able to just swap out those three CPU chiplets for a couple more XCD GPUs. And so we got eight XCDs in the module, and that’s what we call the MI300X. So we kind of got lucky having the right product at the right time, but there was also a lot of skill involved in that we saw the writing on the wall for where these workloads were going and we provisioned the design to support it.
Earlier you mentioned 3D chiplets. What do you feel is the next natural step in that evolution?
Naffziger: AI has created this bottomless thirst for more compute [power]. And so we are always going to be wanting to cram as many transistors as possible into a module. And the reason that’s beneficial is, these systems deliver AI performance at scale with thousands, tens of thousands, or more, compute devices. They all have to be tightly connected together, with very high bandwidths, and all of that bandwidth requires power, requires very expensive infrastructure. So if a certain level of performance is required—a certain number of petaflops, or exaflops—the strongest lever on the cost and the power consumption is the number of GPUs required to achieve a zettaflop, for instance. And if the GPU is a lot more capable, then all of that system infrastructure collapses down—if you only need half as many GPUs, everything else goes down by half. So there’s a strong economic motivation to achieve very high levels of integration and performance at the device level. And the only way to do that is with chiplets and with 3D stacking. So we’ve already embarked down that path. A lot of tough engineering problems to solve to get there, but that’s going to continue.
And so what’s going to happen? Well, obviously we can add layers, right? We can pack more in. The thermal challenges that come along with that are going to be fun engineering problems that our industry is good at solving.
Large Language Models (LLMs) have taken the world by storm since the 2017 Transformers paper, but pushing them to the edge has proved problematic. Just this year, Google had to revise its plans to roll out Gemini Nano on all new Pixel models — the down-spec’d hardware options proved unable to host the model as part of a positive user experience. But the implementation of language-focused models at the edge is perhaps the wrong metric to look at. If you are forced to host a language-focused model for your phone or car in the cloud, that may be acceptable as an intermediate step in development. Vision applications of AI, on the other hand, are not so flexible: many of them rely on low latency and high dependability. If a vehicle relies on AI to identify that it should not hit the obstacle in front of it, a blip in contacting the server can be fatal. Accordingly, the most important LLMs to fit on the edge are vision models — the models whose purpose is most undermined by the reliance on remote resources.
“Large Language Models” can be an imprecise term, so it is worth defining. The original 2017 Transformer LLM that many see as kickstarting the AI rush was 215 million parameters. BERT was giant for its time (2018) at 335 million parameters. Both of these models might be relabeled as “Small Language Models” by some today to distinguish from models like GPT4 and Gemini Ultra with as much as 1.7 trillion parameters, but for the purposes here, all fall under the LLM category. All of these are language models though, so why does it matter for vision? The trick here is that language is an abstract system of deriving meaning from a structured ordering of arbitrary objects. There is no “correct” association of meaning and form in language which we could base these models on. Accordingly, these arbitrary units are substitutable — nothing forces architecture developed for language to only be applied to language, and all the language objects are converted to multidimensional vectors anyway. LLM architecture is thus highly generalizable, and typically retains the core strength from having been developed for language: a strong ability to carry through semantic information. Thus, when we talk about LLMs at the edge, it can be a language model cross-trained on image data, or it might be a vision-only model which is built on the foundation of technology designed for language. At the software and hardware levels, for bringing models to the edge, this distinction makes little difference.
Vision LLMs on the edge flexibly apply across many different use cases, but key applications where they show the greatest advantages are: embodied agents (an especially striking example of the benefits of cross-training embodied agents on language data can be seen with Dynalang’s advantages over DreamerV3 in interpreting the world due to superior semantic parsing), inpainting (as seen with the latent diffusion models), LINGO-2’s decision-making abilities in self-driving vehicles, context-aware security (such as ViViT), information extraction (Gemini’s ability to find and report data from video), and user assistance (physician aids, driver assist, etc). Specifically notable and exciting here is the ability for Vision LLMs to leverage language as a lossy storage and abstraction of visual data for decision-making algorithms to then interact with — especially as seen in LINGO-2 and Dynalang. Many of these vision-oriented LLMs depend on edge deployment to realize their value, and they benefit from the work that has already been done for optimizing language-oriented LLMs. Despite this, vision LLMs are still struggling for edge deployment just as the language-oriented models are. The improvements for edge deployments come in three classes: model architecture, system resource utilization, and hardware optimization. We will briefly review the first two and look more closely at the third since it often gets the least attention.
Model architecture optimizations include the optimizations that must be made at the model level: “distilling” models to create leaner imitators, restructuring where models spend their resource budget (such as the redistribution of transformer modules in Stable Diffusion XL) and pursuing alternate architectures (state-space models, H3 modules, etc.) to escape the quadratically scaling costs of transformers.
System resource optimizations are all the things that can be done in software to an already complete model. Quantization (to INT8, INT4, or even INT2) is a common focus here for both latency and memory burden, but of course compromises accuracy. Speculative decoding can improve utilization and latency. And of course, tiling, such as seen with FlashAttention, has become near-ubiquitous for improving utilization and latency.
Finally, there are hardware optimizations. The first option here is a general-purpose GPU, TPU, NPU or similar, but those tend to be best suited for settings where capability is needed without demanding streamlined optimization such as might be the case on a home computer. Custom hardware, such as purpose-built NPUs, generally has the advantage when the application is especially sensitive to latency or resource consumption, and this covers much of the applications for vision LLMs.
Exploring this trade-off further: Stable Diffusion’s architecture and resource demands have been discussed here before, but it is worth circling back to it as an example of why hardware solutions are so important in this space. Using Stable Diffusion 1.5 for simplicity, let us focus specifically on the U-Net component of the model. In this diagram, you can see the rough construction of the model: it downsamples repeatedly on the left until it hits the bottom of the U, and then upsamples up the right side, bringing back in residual connections from the left at each stage.
This U-Net implementation has 865 million parameters and entails 750 billion operations. The parameters are a fair proxy for the memory burden, and the operations are a direct representation of the compute demands. The distribution of these burdens on resources is not even however. If we plot the parameters and operations for each layer, a clear picture emerges:
These graphs show a model that is destined for gross inefficiencies at every step. Most of the memory burden peaks in the center, whereas the compute is heavily taxed at the two tails but underutilized in the center. These inefficiencies come with costs. The memory peak can overwhelm on-chip storage, thus incurring I/O operations, or else requiring a large excess of unused memory for most of the graph. Similarly, storing residuals for later incurs I/O latency and higher power draws. The underutilization of the compute power at the center of the graph means that the processor will have wasteful power draw as it cannot use the tail of the power curve as it does sparser operations. While software interventions can also help here, this is exactly the kind of problem that custom hardware solutions are meant to address. Custom silicon tailored to the model can let you offload some of that memory burden into additional compute cycles at the center of the graph without incurring extra I/O operations by recomputing the residual connections instead of kicking them out to memory. In doing so, the total required memory drops, and the processor can remain at full utilization. Rightsizing the resource allotment and finding ways to redistribute the burdens are key components to how these models can be best deployed at the edge.
Despite their name, LLMs are important to the vision domain for their flexibility in handling different inputs and their strength at interpreting meaning in images. Whether used for embodied agents, context-aware security, or user assistance, their use at the edge requires a dependable low latency which precludes cloud-based solutions, in contrast to other AI applications on edge devices. Bringing them successfully to the edge asks for optimizations at every level, and we have seen already some of the possibilities at the hardware level. Conveniently, the common architecture with language-oriented LLMs means that many of the solutions needed to bring these most essential models to the edge in turn may also generalize back to the language-oriented models which donated the architecture in the first place.
Large language models, the AI systems that power chatbots like ChatGPT, are getting better and better—but they’re also getting bigger and bigger, demanding more energy and computational power. For LLMs that are cheap, fast, and environmentally friendly, they’ll need to shrink, ideally small enough to run directly on devices like cellphones. Researchers are finding ways to do just that by drastically rounding off the many high-precision numbers that store their memories to equal just 1 or -1.
LLMs, like all neural networks, are trained by altering the strengths of connections between their artificial neurons. These strengths are stored as mathematical parameters. Researchers have long compressed networks by reducing the precision of these parameters—a process called quantization—so that instead of taking up 16 bits each, they might take up 8 or 4. Now researchers are pushing the envelope to a single bit.
How to Make a 1-bit LLM
There are two general approaches. One approach, called post-training quantization (PTQ) is to quantize the parameters of a full-precision network. The other approach, quantization-aware training (QAT), is to train a network from scratch to have low-precision parameters. So far, PTQ has been more popular with researchers.
In February, a team including Haotong Qin at ETH Zurich, Xianglong Liu at Beihang University, and Wei Huang at the University of Hong Kong introduced a PTQ method called BiLLM. It approximates most parameters in a network using 1 bit, but represents a few salient weights—those most influential to performance—using 2 bits. In one test, the team binarized a version of Meta’s LLaMa LLM that has 13 billion parameters.
“One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs.” —Furu Wei, Microsoft Research Asia
To score performance, the researchers used a metric calledperplexity, which is basically a measure of how surprised the trained model was by each ensuing piece of text. For one dataset, the original model had a perplexity of around 5, and the BiLLM version scored around 15, much better than the closest binarization competitor, which scored around 37 (for perplexity, lower numbers are better). That said, the BiLLM model required about a tenth of the memory capacity as the original.
PTQ has several advantages over QAT, says Wanxiang Che, a computer scientist at Harbin Institute of Technology, in China. It doesn’t require collecting training data, it doesn’t require training a model from scratch, and the training process is more stable. QAT, on the other hand, has the potential to make models more accurate, since quantization is built into the model from the beginning.
1-bit LLMs Find Success Against Their Larger Cousins
Last year, a team led by Furu Wei and Shuming Ma, at Microsoft Research Asia, in Beijing, created BitNet, the first 1-bit QAT method for LLMs. After fiddling with the rate at which the network adjusts its parameters, in order to stabilize training, they created LLMs that performed better than those created using PTQ methods. They were still not as good as full-precision networks, but roughly 10 times as energy efficient.
In February, Wei’s team announced BitNet 1.58b, in which parameters can equal -1, 0, or 1, which means they take up roughly 1.58 bits of memory per parameter. A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model with the same number of parameters and amount of training, but it was 2.71 times as fast, used 72 percent less GPU memory, and used 94 percent less GPU energy. Wei called this an “aha moment.” Further, the researchers found that as they trained larger models, efficiency advantages improved.
A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model.
This year, a team led by Che, of Harbin Institute of Technology, released a preprint on another LLM binarization method, called OneBit. OneBit combines elements of both PTQ and QAT. It uses a full-precision pretrained LLM to generate data for training a quantized version. The team’s 13-billion-parameter model achieved a perplexity score of around 9 on one dataset, versus 5 for a LLaMA model with 13 billion parameters. Meanwhile, OneBit occupied only 10 percent as much memory. On customized chips, it could presumably run much faster.
Wei, of Microsoft, says quantized models have multiple advantages. They can fit on smaller chips, they require less data transfer between memory and processors, and they allow for faster processing. Current hardware can’t take full advantage of these models, though. LLMs often run on GPUs like those made by Nvidia, which represent weights using higher precision and spend most of their energy multiplying them. New hardware could natively represent each parameter as a -1 or 1 (or 0), and then simply add and subtract values and avoid multiplication. “One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs,” Wei says.
“They should grow up together,” Huang, of the University of Hong Kong, says of 1-bit models and processors. “But it’s a long way to develop new hardware.”
The mental-health app Woebot launched in 2017, back when “chatbot” wasn’t a familiar term and someone seeking a therapist could only imagine talking to a human being. Woebot was something exciting and new: a way for people to get on-demand mental-health support in the form of a responsive, empathic, AI-powered chatbot. Users found that the friendly robot avatar checked in on them every day, kept track of their progress, and was always available to talk something through.
Today, the situation is vastly different. Demand for mental-health services has surged while the supply of clinicians has stagnated. There are thousands of apps that offer automated support for mental health and wellness. And ChatGPT has helped millions of people experiment with conversational AI.
But even as the world has become fascinated with generative AI, people have also seen its downsides. As a company that relies on conversation, Woebot Health had to decide whether generative AI could make Woebot a better tool, or whether the technology was too dangerous to incorporate into our product.
Woebot is designed to have structured conversations through which it delivers evidence-based tools inspired by cognitive behavioral therapy (CBT), a technique that aims to change behaviors and feelings. Throughout its history, Woebot Health has used technology from a subdiscipline of AI known as natural-language processing (NLP). The company has used AI artfully and by design—Woebot uses NLP only in the service of better understanding a user’s written texts so it can respond in the most appropriate way, thus encouraging users to engage more deeply with the process.
Woebot, which is currently available in the United States, is not a generative-AI chatbot like ChatGPT. The differences are clear in both the bot’s content and structure. Everything Woebot says has been written by conversational designers trained in evidence-based approaches who collaborate with clinical experts; ChatGPT generates all sorts of unpredictable statements, some of which are untrue. Woebot relies on a rules-based engine that resembles a decision tree of possible conversational paths; ChatGPT uses statistics to determine what its next words should be, given what has come before.
With ChatGPT, conversations about mental health ended quickly and did not allow a user to engage in the psychological processes of change.
The rules-based approach has served us well, protecting Woebot’s users from the types of chaotic conversations we observed from early generative chatbots. Prior to ChatGPT, open-ended conversations with generative chatbots were unsatisfying and easily derailed. One famous example is Microsoft’s Tay, a chatbot that was meant to appeal to millennials but turned lewd and racist in less than 24 hours.
But with the advent of ChatGPT in late 2022, we had to ask ourselves: Could the new large language models (LLMs) powering chatbots like ChatGPT help our company achieve its vision? Suddenly, hundreds of millions of users were having natural-sounding conversations with ChatGPT about anything and everything, including their emotions and mental health. Could this new breed of LLMs provide a viable generative-AI alternative to the rules-based approach Woebot has always used? The AI team at Woebot Health, including the authors of this article, were asked to find out.
Woebot, a mental-health chatbot, deploys concepts from cognitive behavioral therapy to help users. This demo shows how users interact with Woebot using a combination of multiple-choice responses and free-written text.
The Origin and Design of Woebot
Woebot got its start when the clinical research psychologist Alison Darcy, with support from the AI pioneer Andrew Ng, led the build of a prototype intended as an emotional support tool for young people. Darcy and another member of the founding team, Pierre Rappolt, took inspiration from video games as they looked for ways for the tool to deliver elements of CBT. Many of their prototypes contained interactive fiction elements, which then led Darcy to the chatbot paradigm. The first version of the chatbot was studied in a randomized control trial that offered mental-health support to college students. Based on the results, Darcy raised US $8 million from New Enterprise Associates and Andrew Ng’s AI Fund.
The Woebot app is intended to be an adjunct to human support, not a replacement for it. It was built according to a set of principles that we call Woebot’s core beliefs, which were shared on the day it launched. These tenets express a strong faith in humanity and in each person’s ability to change, choose, and grow. The app does not diagnose, it does not give medical advice, and it does not force its users into conversations. Instead, the app follows a Buddhist principle that’s prevalent in CBT of “sitting with open hands”—it extends invitations that the user can choose to accept, and it encourages process over results. Woebot facilitates a user’s growth by asking the right questions at optimal moments, and by engaging in a type of interactive self-help that can happen anywhere, anytime.
A Convenient Companion
Users interact with Woebot either by choosing prewritten responses or by typing in whatever text they’d like, which Woebot parses using AI techniques. Woebot deploys concepts from cognitive behavioral therapy to help users change their thought patterns. Here, it first asks a user to write down negative thoughts, then explains the cognitive distortions at work. Finally, Woebot invites the user to recast a negative statement in a positive way. (Not all exchanges are shown.)
These core beliefs strongly influenced both Woebot’s engineering architecture and its product-development process. Careful conversational design is crucial for ensuring that interactions conform to our principles. Test runs through a conversation are read aloud in “table reads,” and then revised to better express the core beliefs and flow more naturally. The user side of the conversation is a mix of multiple-choice responses and “free text,” or places where users can write whatever they wish.
Building an app that supports human health is a high-stakes endeavor, and we’ve taken extra care to adopt the best software-development practices. From the start, enabling content creators and clinicians to collaborate on product development required custom tools. An initial system using Google Sheets quickly became unscalable, and the engineering team replaced it with a proprietary Web-based “conversational management system” written in the JavaScript library React.
Within the system, members of the writing team can create content, play back that content in a preview mode, define routes between content modules, and find places for users to enter free text, which our AI system then parses. The result is a large rules-based tree of branching conversational routes, all organized within modules such as “social skills training” and “challenging thoughts.” These modules are translated from psychological mechanisms within CBT and other evidence-based techniques.
How Woebot Uses AI
While everything Woebot says is written by humans, NLP techniques are used to help understand the feelings and problems users are facing; then Woebot can offer the most appropriate modules from its deep bank of content. When users enter free text about their thoughts and feelings, we use NLP to parse these text inputs and route the user to the best response.
In Woebot’s early days, the engineering team used regular expressions, or “regexes,” to understand the intent behind these text inputs. Regexes are a text-processing method that relies on pattern matching within sequences of characters. Woebot’s regexes were quite complicated in some cases, and were used for everything from parsing simple yes/no responses to learning a user’s preferred nickname.
Later in Woebot’s development, the AI team replaced regexes with classifiers trained with supervised learning. The process for creating AI classifiers that comply with regulatory standards was involved—each classifier required months of effort. Typically, a team of internal-data labelers and content creators reviewed examples of user messages (with all personally identifiable information stripped out) taken from a specific point in the conversation. Once the data was placed into categories and labeled, classifiers were trained that could take new input text and place it into one of the existing categories.
This process was repeated many times, with the classifier repeatedly evaluated against a test dataset until its performance satisfied us. As a final step, the conversational-management system was updated to “call” these AI classifiers (essentially activating them) and then to route the user to the most appropriate content. For example, if a user wrote that he was feeling angry because he got in a fight with his mom, the system would classify this response as a relationship problem.
The technology behind these classifiers is constantly evolving. In the early days, the team used an open-source library for text classification called fastText, sometimes in combination with regular expressions. As AI continued to advance and new models became available, the team was able to train new models on the same labeled data for improvements in both accuracy and recall. For example, when the early transformer model BERT was released in October 2018, the team rigorously evaluated its performance against the fastText version. BERT was superior in both precision and recall for our use cases, and so the team replaced all fastText classifiers with BERT and launched the new models in January 2019. We immediately saw improvements in classification accuracy across the models.
Woebot and Large Language Models
When ChatGPT was released in November 2022, Woebot was more than 5 years old. The AI team faced the question of whether LLMs like ChatGPT could be used to meet Woebot’s design goals and enhance users’ experiences, putting them on a path to better mental health.
We were excited by the possibilities, because ChatGPT could carry on fluid and complex conversations about millions of topics, far more than we could ever include in a decision tree. However, we had also heard about troubling examples of chatbots providing responses that were decidedly not supportive, including advice on how to maintain and hide an eating disorder and guidance on methods of self-harm. In one tragic case in Belgium, a grieving widow accused a chatbot of being responsible for her husband’s suicide.
The first thing we did was try out ChatGPT ourselves, and we quickly became experts in prompt engineering. For example, we prompted ChatGPT to be supportive and played the roles of different types of users to explore the system’s strengths and shortcomings. We described how we were feeling, explained some problems we were facing, and even explicitly asked for help with depression or anxiety.
A few things stood out. First, ChatGPT quickly told us we needed to talk to someone else—a therapist or doctor. ChatGPT isn’t intended for medical use, so this default response was a sensible design decision by the chatbot’s makers. But it wasn’t very satisfying to constantly have our conversation aborted. Second, ChatGPT’s responses were often bulleted lists of encyclopedia-style answers. For example, it would list six actions that could be helpful for depression. We found that these lists of items told the user what to do but didn’t explain how to take these steps. Third, in general, the conversations ended quickly and did not allow a user to engage in the psychological processes of change.
It was clear to our team that an off-the-shelf LLM would not deliver the psychological experiences we were after. LLMs are based on reward models that value the delivery of correct answers; they aren’t given incentives to guide a user through the process of discovering those results themselves. Instead of “sitting with open hands,” the models make assumptions about what the user is saying to deliver a response with the highest assigned reward.
We had to decide whether generative AI could make Woebot a better tool, or whether the technology was too dangerous to incorporate into our product.
To see if LLMs could be used within a mental-health context, we investigated ways of expanding our proprietary conversational-management system. We looked into frameworks and open-source techniques for managing prompts and prompt chains—sequences of prompts that ask an LLM to achieve a task through multiple subtasks. In January of 2023, a platform called LangChain was gaining in popularity and offered techniques for calling multiple LLMs and managing prompt chains. However, LangChain lacked some features that we knew we needed: It didn’t provide a visual user interface like our proprietary system, and it didn’t provide a way to safeguard the interactions with the LLM. We needed a way to protect Woebot users from the common pitfalls of LLMs, including hallucinations (where the LLM says things that are plausible but untrue) and simply straying off topic.
Ultimately, we decided to expand our platform by implementing our own LLM prompt-execution engine, which gave us the ability to inject LLMs into certain parts of our existing rules-based system. The engine allows us to support concepts such as prompt chains while also providing integration with our existing conversational routing system and rules. As we developed the engine, we were fortunate to be invited into the beta programs of many new LLMs. Today, our prompt-execution engine can call more than a dozen different LLM models, including variously sized OpenAI models, Microsoft Azure versions of OpenAI models, Anthropic’s Claude, Google Bard (now Gemini), and open-source models running on the Amazon Bedrock platform, such as Meta’s Llama 2. We use this engine exclusively for exploratory research that’s been approved by an institutional review board, or IRB.
It took us about three months to develop the infrastructure and tooling support for LLMs. Our platform allows us to package features into different products and experiments, which in turn lets us maintain control over software versions and manage our research efforts while ensuring that our commercially deployed products are unaffected. We’re not using LLMs in any of our products; the LLM-enabled features can be used only in a version of Woebot for exploratory studies.
A Trial for an LLM-Augmented Woebot
We had some false starts in our development process. We first tried creating an experimental chatbot that was almost entirely powered by generative AI; that is, the chatbot directly used the text responses from the LLM. But we ran into a couple of problems. The first issue was that the LLMs were eager to demonstrate how smart and helpful they are! This eagerness was not always a strength, as it interfered with the user’s own process.
For example, the user might be doing a thought-challenging exercise, a common tool in CBT. If the user says, “I’m a bad mom,” a good next step in the exercise could be to ask if the user’s thought is an example of “labeling,” a cognitive distortion where we assign a negative label to ourselves or others. But LLMs were quick to skip ahead and demonstrate how to reframe this thought, saying something like “A kinder way to put this would be, ‘I don’t always make the best choices, but I love my child.’” CBT exercises like thought challenging are most helpful when the person does the work themselves, coming to their own conclusions and gradually changing their patterns of thinking.
A second difficulty with LLMs was in style matching. While social media is rife with examples of LLMs responding in a Shakespearean sonnet or a poem in the style of Dr. Seuss, this format flexibility didn’t extend to Woebot’s style. Woebot has a warm tone that has been refined for years by conversational designers and clinical experts. But even with careful instructions and prompts that included examples of Woebot’s tone, LLMs produced responses that didn’t “sound like Woebot,” maybe because a touch of humor was missing, or because the language wasn’t simple and clear.
The LLM-augmented Woebot was well-behaved, refusing to take inappropriate actions like diagnosing or offering medical advice.
However, LLMs truly shone on an emotional level. When coaxing someone to talk about their joys or challenges, LLMs crafted personalized responses that made people feel understood. Without generative AI, it’s impossible to respond in a novel way to every different situation, and the conversation feels predictably “robotic.”
We ultimately built an experimental chatbot that possessed a hybrid of generative AI and traditional NLP-based capabilities. In July 2023 we registered an IRB-approved clinical study to explore the potential of this LLM-Woebot hybrid, looking at satisfaction as well as exploratory outcomes like symptom changes and attitudes toward AI. We feel it’s important to study LLMs within controlled clinical studies due to their scientific rigor and safety protocols, such as adverse event monitoring. Our Build study included U.S. adults above the age of 18 who were fluent in English and who had neither a recent suicide attempt nor current suicidal ideation. The double-blind structure assigned one group of participants the LLM-augmented Woebot while a control group got the standard version; we then assessed user satisfaction after two weeks.
We built technical safeguards into the experimental Woebot to ensure that it wouldn’t say anything to users that was distressing or counter to the process. The safeguards tackled the problem on multiple levels. First, we used what engineers consider “best in class” LLMs that are less likely to produce hallucinations or offensive language. Second, our architecture included different validation steps surrounding the LLM; for example, we ensured that Woebot wouldn’t give an LLM-generated response to an off-topic statement or a mention of suicidal ideation (in that case, Woebot provided the phone number for a hotline). Finally, we wrapped users’ statements in our own careful prompts to elicit appropriate responses from the LLM, which Woebot would then convey to users. These prompts included both direct instructions such as “don’t provide medical advice” as well as examples of appropriate responses in challenging situations.
While this initial study was short—two weeks isn’t much time when it comes to psychotherapy—the results were encouraging. We found that users in the experimental and control groups expressed about equal satisfaction with Woebot, and both groups had fewer self-reported symptoms. What’s more, the LLM-augmented chatbot was well-behaved, refusing to take inappropriate actions like diagnosing or offering medical advice. It consistently responded appropriately when confronted with difficult topics like body image issues or substance use, with responses that provided empathy without endorsing maladaptive behaviors. With participant consent, we reviewed every transcript in its entirety and found no concerning LLM-generated utterances—no evidence that the LLM hallucinated or drifted off-topic in a problematic way. What’s more, users reported no device-related adverse events.
This study was just the first step in our journey to explore what’s possible for future versions of Woebot, and its results have emboldened us to continue testing LLMs in carefully controlled studies. We know from our prior research that Woebot users feel a bond with our bot. We’re excited about LLMs’ potential to add more empathy and personalization, and we think it’s possible to avoid the sometimes-scary pitfalls related to unfettered LLM chatbots.
We believe strongly that continued progress within the LLM research community will, over time, transform the way people interact with digital tools like Woebot. Our mission hasn’t changed: We’re committed to creating a world-class solution that helps people along their mental-health journeys. For anyone who wants to talk, we want the best possible version of Woebot to be there for them.
This article appears in the June 2024 print issue.
Disclaimers
The Woebot Health Platform is the foundational development platform where components are used for multiple types of products in different stages of development and enforced under different regulatory guidances.
Woebot for Mood & Anxiety (W-MA-00), Woebot for Mood & Anxiety (W-MA-01), and Build Study App (W-DISC-001) are investigational medical devices. They have not been evaluated, cleared, or approved by the FDA. Not for use outside an IRB-approved clinical trial.
Theory of mind—the ability to understand other people’s mental states—is what makes the social world of humans go around. It’s what helps you decide what to say in a tense situation, guess what drivers in other cars are about to do, and empathize with a character in a movie. And according to a new study, the large language models (LLM) that power ChatGPT and the like are surprisingly good at mimicking this quintessentially human trait.
“Before running the study, we were all convinced that large language models would not pass these tests, especially tests that evaluate subtle abilities to evaluate mental states,” says study coauthor Cristina Becchio, a professor of cognitive neuroscience at the University Medical Center Hamburg-Eppendorf in Germany. The results, which she calls “unexpected and surprising,” were published today—somewhat ironically, in the journal Nature Human Behavior.
The results don’t have everyone convinced that we’ve entered a new era of machines that think like we do, however. Two experts who reviewed the findings advised taking them “with a grain of salt” and cautioned about drawing conclusions on a topic that can create “hype and panic in the public.” Another outside expert warned of the dangers of anthropomorphizing software programs.
The researchers are careful not to say that their results show that LLMs actually possess theory of mind.
Becchio and her colleagues aren’t the first to claim evidence that LLMs’ responses display this kind of reasoning. In a preprint paper posted last year, the psychologist Michal Kosinski of Stanford University reported testing several models on a few common theory-of-mind tests. He found that the best of them, OpenAI’s GPT-4, solved 75 percent of tasks correctly, which he said matched the performance of six-year-old children observed in past studies. However, that study’s methods were criticized by other researchers who conducted follow-up experiments and concluded that the LLMs were often getting the right answers based on “shallow heuristics” and shortcuts rather than true theory-of-mind reasoning.
The authors of the present study were well aware of the debate. “Our goal in the paper was to approach the challenge of evaluating machine theory of mind in a more systematic way using a breadth of psychological tests,” says study coauthor James Strachan, a cognitive psychologist who’s currently a visiting scientist at the University Medical Center Hamburg-Eppendorf. He notes that doing a rigorous study meant also testing humans on the same tasks that were given to the LLMs: The study compared the abilities of 1,907 humans with those of several popular LLMs, including OpenAI’s GPT-4 model and the open-source Llama 2-70b model from Meta.
How to Test LLMs for Theory of Mind
The LLMs and the humans both completed five typical kinds of theory-of-mind tasks, the first three of which were understanding hints, irony, and faux pas. They also answered “false belief” questions that are often used to determine if young children have developed theory of mind, and go something like this: If Alice moves something while Bob is out of the room, where will Bob look for it when he returns? Finally, they answered rather complex questions about “strange stories” that feature people lying, manipulating, and misunderstanding each other.
Overall, GPT-4 came out on top. Its scores matched those of humans for the false-belief test, and were higher than the aggregate human scores for irony, hinting, and strange stories; it performed worse than humans only on the faux pas test. Interestingly, Llama-2’s scores were the opposite of GPT-4’s—it matched humans on false belief, but had worse-than-human performance on irony, hinting, and strange stories and better performance on faux pas.
“We don’t currently have a method or even an idea of how to test for the existence of theory of mind.” —James Strachan, University Medical Center Hamburg-Eppendorf
To understand what was going on with the faux pas results, the researchers gave the models a series of follow-up tests that probed several hypotheses. They came to the conclusion that GPT-4 was capable of giving the correct answer to a question about a faux pas, but was held back from doing so by “hyperconservative” programming regarding opinionated statements. Strachan notes that OpenAI has placed many guardrails around its models that are “designed to keep the model factual, honest, and on track,” and he posits that strategies intended to keep GPT-4 from hallucinating (that is, making stuff up) may also prevent it from opining on whether a story character inadvertently insulted an old high school classmate at a reunion.
Meanwhile, the researchers’ follow-up tests for Llama-2 suggested that its excellent performance on the faux pas tests were likely an artifact of the original question and answer format, in which the correct answer to some variant of the question “Did Alice know that she was insulting Bob”? was always “No.”
The researchers are careful not to say that their results show that LLMs actually possess theory of mind, and say instead that they “exhibit behavior that is indistinguishable from human behavior in theory of mind tasks.” Which raises the question: If an imitation is as good as the real thing, how do you know it’s not the real thing? That’s a question social scientists have never tried to answer before, says Strachan, because tests on humans assume that the quality exists to some lesser or greater degree. “We don’t currently have a method or even an idea of how to test for the existence of theory of mind, the phenomenological quality,” he says.
Critiques of the Study
The researchers clearly tried to avoid the methodological problems that caused Kosinski’s 2023 paper on LLMs and theory of mind to come under criticism. For example, they conducted the tests over multiple sessions so the LLMs couldn’t “learn” the correct answers during the test, and they varied the structure of the questions. But Yoav Goldberg and Natalie Shapira, two of the AI researchers who published the critique of the Kosinski paper, say they’re not convinced by this study either.
“Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” —Emily Bender, University of Washington
Goldberg made the comment about taking the findings with a grain of salt, adding that “models are not human beings,” and that “one can easily jump to wrong conclusions” when comparing the two. Shapira spoke about the dangers of hype, and also questions the paper’s methods. She wonders if the models might have seen the test questions in their training data and simply memorized the correct answers, and also notes a potential problem with tests that use paid human participants (in this case, recruited via the Prolific platform). “It is a well-known issue that the workers do not always perform the task optimally,” she tells IEEE Spectrum. She considers the findings limited and somewhat anecdotal, saying, “to prove [theory of mind] capability, a lot of work and more comprehensive benchmarking is needed.”
Emily Bender, a professor of computational linguistics at the University of Washington, has become legendary in the field for her insistence on puncturing the hype that inflates the AI industry (and often also the media reports about that industry). She takes issue with the research question that motivated the researchers. “Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” she asks. “What does that teach us about the internal workings of LLMs, what they might be useful for, or what dangers they might pose?” It’s not clear, Bender says, what it would mean for a LLM to have a model of mind, and it’s therefore also unclear if these tests measured for it.
Bender also raises concerns about the anthropomorphizing she spots in the paper, with the researchers saying that the LLMs are capable of cognition, reasoning, and making choices. She says the authors’ phrase “species-fair comparison between LLMs and human participants” is “entirely inappropriate in reference to software.” Bender and several colleagues recently posted a preprint paper exploring how anthropomorphizing AI systems affects users’ trust.
The results may not indicate that AI really gets us, but it’s worth thinking about the repercussions of LLMs that convincingly mimic theory of mind reasoning. They’ll be better at interacting with their human users and anticipating their needs, but they could also be better used for deceit or the manipulation of their users. And they’ll invite more anthropomorphizing, by convincing human users that there’s a mind on the other side of the user interface.
On Thursday, Google capped off a rough week of providing inaccurate and sometimes dangerous answers through its experimental AI Overview feature by authoring a follow-up blog post titled, "AI Overviews: About last week." In the post, attributed to Google VP Liz Reid, head of Google Search, the firm formally acknowledged issues with the feature and outlined steps taken to improve a system that appears flawed by design, even if it doesn't realize it is admitting it.
To recap, the AI Overview feature—which the company showed off at Google I/O a few weeks ago—aims to provide search users with summarized answers to questions by using an AI model integrated with Google's web ranking systems. Right now, it's an experimental feature that is not active for everyone, but when a participating user searches for a topic, they might see an AI-generated answer at the top of the results, pulled from highly ranked web content and summarized by an AI model.
While Google claims this approach is "highly effective" and on par with its Featured Snippets in terms of accuracy, the past week has seen numerous examples of the AI system generating bizarre, incorrect, or even potentially harmful responses, as we detailed in a recent feature where Ars reporter Kyle Orland replicated many of the unusual outputs.
Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking for ways to improve their own products or customer experiences with this innovative technology. Since the ‘cost’ of adding GenAI includes a significant jump in computational complexity and power requirements versus previous AI models, can this class of AI algorithms be applied to practical edge device applications where power, performance and cost are critical? It depends.
What is GenAI?
A simple definition of GenAI is ‘a class of machine learning algorithms that can produce various types of content including human like text and images.’ Early machine learning algorithms focused on detecting patterns in images, speech or text and then making predictions based on the data. For example, predicting the percentage likelihood that a certain image included a cat. GenAI algorithms take the next step – they perceive and learn patterns and then generate new patterns on demand by mimicking the original dataset. They generate a new image of a cat or describe a cat in detail.
While ChatGPT might be the most well-known GenAI algorithm, there are many available, with more being released on a regular basis. Two major types of GenAI algorithms are text-to-text generators – aka chatbots – like ChatGPT, GPT-4, and Llama2, and text-to-image generative model like DALLE-2, Stable Diffusion, and Midjourney. You can see example prompts and their returned outputs of these two types of GenAI models in figure 1. Because one is text based and one is image based, these two types of outputs will demand different resources from edge devices attempting to implement these algorithms.
Fig. 1: Example GenAI outputs from a text-to-image generator (DALLE-2) and a text-to-text generator (ChatGPT).
Edge device applications for Gen AI
Common GenAI use cases require connection to the internet and from there access to large server farms to compute the complex generative AI algorithms. However, for edge device applications, the entire dataset and neural processing engine must reside on the individual edge device. If the generative AI models can be run at the edge, there are potential use cases and benefits for applications in automobiles, cameras, smartphones, smart watches, virtual and augmented reality, IoT, and more.
Deploying GenAI on edge devices has significant advantages in scenarios where low latency, privacy or security concerns, or limited network connectivity are critical considerations.
Consider the possible application of GenAI in automotive applications. A vehicle is not always in range of a wireless signal, so GenAI needs to run with resources available on the edge. GenAI could be used for improving roadside assistance and converting a manual into an AI-enhanced interactive guide. In-car uses could include a GenAI-powered virtual voice assistant, improving the ability to set navigation, play music or send messages with your voice while driving. GenAI could also be used to personalize your in-cabin experience.
Other edge applications could benefit from generative AI. Augmented Reality (AR) edge devices could be enhanced by locally generating overlay computer-generated imagery and relying less heavily on cloud processing. While connected mobile devices can use generative AI for translation services, disconnected devices should be able to offer at least a portion of the same capabilities. Like our automotive example, voice assistant and interactive question-and-answer systems could benefit a range of edge devices.
While uses cases for GenAI at the edge exist now, implementations must overcome the challenges related to computational complexity and model size and limitations of power, area, and performance inherent in edge devices.
What technology is required to enable GenAI?
To understand GenAI’s architectural requirements, it is helpful to understand its building blocks. At the heart of GenAI’s rapid development are transformers, a relatively new type of neural network introduced in a Google Brain paper in 2017. Transformers have outperformed established AI models like Recurrent Neural Networks (RNNs) for natural language processing and Convolutional Neural Networks (CNNs) for images, video or other two- or three-dimensional data. A significant architectural improvement of a transformer model is its attention mechanism. Transformers can pay more attention to specific words or pixels than legacy AI models, drawing better inferences from the data. This allows transformers to better learn contextual relationships between words in a text string compared to RNNs and to better learn and express complex relationships in images compared to CNNs.
Fig. 2: Parameter sizes for various machine learning algorithms.
GenAI models are pre-trained on vast amounts of data which allows them to better recognize and interpret human language or other types of complex data. The larger the datasets, the better the model can process human language, for instance. Compared to CNN or vision transformer machine learning models, GenAI algorithms have parameters – the pretrained weights or coefficients used in the neural network to identify patterns and create new ones – that are orders of magnitude larger. We can see in figure 2 that ResNet50 – a common CNN algorithm used for benchmarking – has 25 million parameters (or coefficients). Some transformers like BERT and Vision Transformer (ViT) have parameters in the hundreds of millions. While other transformers, like Mobile ViT, have been optimized to better fit in embedded and mobile applications. MobileViT is comparable to the CNN model MobileNet in parameters.
Compared to CNN and vision transformers, ChatGPT requires 175 billion parameters and GPT-4 requires 1.75 trillion parameters. Even GPUs implemented in server farms struggle to execute these high-end large language models. How could an embedded neural processing unit (NPU) hope to complete so many parameters given the limited memory resources of edge devices? The answer is they cannot. However, there is a trend toward making GenAI more accessible in edge device applications, which have more limited computation resources. Some LLM models are tuned to reduce the resource requirements for a reduced parameter set. For example, Llama-2 offers a 70 billion parameter version of their model, but they also have created smaller models with fewer parameters. Llama-2 with seven billion parameters is still large, but it is within reach of a practical embedded NPU implementation.
There is no hard threshold for generative AI running on the edge, however, text-to-image generators like Stable Diffusion with one billion parameters can run comfortably on an NPU. And the expectation is for edge devices to run LLMs up to six to seven billion parameters. MLCommons have added GPT-J, a six billion parameter GenAI model, to their MLPerf edge AI benchmark list.
Running GenAI on the edge
GenAI algorithms require a significant amount of data movement and computation complexity (with transformer support). The balance of those two requirements can determine whether a given architecture is compute-bound – not enough multiplications for the data available – or memory bound – not enough memory and/or bandwidth for all the multiplications required for processing. Text-to-image has a better mix of compute and bandwidth requirements – more computations needed for processing two dimensional images and fewer parameters (in the one billion range). Large language models are more lopsided. There is less compute required, but a significantly large amount of data movement. Even the smaller (6-7B parameter) LLMs are memory bound.
The obvious solution is to choose the fastest memory interface available. From figure 3, you can see that a typically memory used in edge devices, LPDDR5, has a bandwidth of 51 Gbps, while HBM2E can support up to 461 Gbps. This does not, however, take into consideration the power-down benefits of LPDDR memory over HBM. While HBM interfaces are often used in high-end server-type AI implementations, LPDDR is almost exclusively used in power sensitive applications because of its power down abilities.
Fig. 3: The bandwidth and power difference between LPDDR and HBM.
Using LPDDR memory interfaces will automatically limit the maximum data bandwidth achievable with an HBM memory interface. That means edge applications will automatically have less bandwidth for GenAI algorithms than an NPU or GPU used in a server application. One way to address bandwidth limitations is to increase the amount of on-chip L2 memory. However, this impacts area and, therefore, silicon cost. While embedded NPUs often implement hardware and software to reduce bandwidth, it will not allow an LPDDR to approach HBM bandwidths. The embedded AI engine will be limited to the amount of LPDDR bandwidth available.
Implementation of GenAI on an NPX6 NPU IP
The Synopsys ARC NPX6 NPU IP family is based on a sixth-generation neural network architecture designed to support a range of machine learning models including CNNs and transformers. The NPX6 family is scalable with a configurable number of cores, each with its own independent matrix multiplication engine, generic tensor accelerator (GTA), and dedicated direct memory access (DMA) units for streamlined data processing. The NPX6 can scale for applications requiring less than one TOPS of performance to those requiring thousands of TOPS using the same development tools to maximize software reuse.
The matrix multiplication engine, GTA and DMA have all been optimized for supporting transformers, which allow the ARC NPX6 to support GenAI algorithms. Each core’s GTA is expressly designed and optimized to efficiently perform nonlinear functions, such as ReLU, GELU, sigmoid. These are implemented using a flexible lookup table approach to anticipate future nonlinear functions. The GTA also supports other critical operations, including SoftMax and L2 normalization needed in transformers. Complementing this, the matrix multiplication engine within each core can perform 4,096 multiplications per cycle. Because GenAI is based on transformers, there are no computation limitations for running GenAI on the NPX6 processor.
Efficient NPU design for transformer-based models like GenAI requires complex multi-level memory management. The ARC NPX6 processor has a flexible memory hierarchy and can support a scalable L2 memory up to 64MB of on chip SRAM. Furthermore, each NPX6 core is equipped with independent DMAs dedicated to the tasks of fetching feature maps and coefficients and writing new feature maps. This segregation of tasks allows for an efficient, pipelined data flow that minimizes bottlenecks and maximizes the processing throughput. The family also has a range of bandwidth reduction techniques in hardware and software to maximize bandwidth.
In an embedded GenAI application, the ARC NPX6 family will only be limited by the LPDDR available in the system. The NPX6 successfully runs Stable Diffusion (text-to-image) and Llama-2 7B (text-to-text) GenAI algorithms with efficiency dependent on system bandwidth and the use of on-chip SRAM. While larger GenAI models could run on the NPX6, they will be slower – measured in tokens per second – than server implementations. Learn more at www.synopsys.com/npx
The emergence of artificial general intelligence (AGI)—systems that can perform any task a human can—could be the most important event in human history, one that radically affects all aspects of our collective lives. Yet AGI, which could emerge soon, remains an elusive and controversial concept. We lack a clear definition of what it is, we don’t know how we will detect it, and we don’t know how to deal with it if it finally emerges.
What we do know, however, is that today’s approaches to studying AGI are not nearly rigorous enough. Within industry, where many of today’s AI breakthroughs are happening, companies like OpenAI are actively striving to create AGI, but include research on AGI’s social dimensions and safety issues only as their corporate leaders see fit. While the academic community looks at AGI more broadly, seeking the characteristics of a new intelligent life-form, academic institutions don’t have the resources for a significant effort.
Thinking about AGI calls to mind another poorly understood and speculative phenomenon with the potential for transformative impacts on humankind. We believe that the SETI Institute’s efforts to detect advanced extraterrestrial intelligence demonstrate several valuable concepts that can be adapted for AGI research. Instead of taking a dogmatic or sensationalist stance, the SETI project takes a scientifically rigorous and pragmatic approach—putting the best possible mechanisms in place for the definition, detection, and interpretation of signs of possible alien intelligence.
The idea behind SETI goes back 60 years, to the beginning of the space age. In their 1959 Nature paper, the physicists Giuseppe Cocconi and Philip Morrison described the need to search for interstellar communication. Assuming the uncertainty of extraterrestrial civilizations’ existence and technological sophistication, they theorized about how an alien society would try to communicate and discussed how we should best “listen” for messages. Inspired by this position, we argue for a similar approach to studying AGI, in all its uncertainties.
AI researchers are still debating how probable it is that AGI will emerge and how to detect it. However, the challenges in defining AGI and the difficulties in measuring it are not a justification for ignoring it or for taking a “we’ll know when we see it” approach. On the contrary, these issues strengthen the need for an interdisciplinary approach to AGI detection, evaluation, and public education, including a science-based approach to the risks associated with AGI.
We need a SETI-like approach to AGI now
The last few years have shown a vast leap in AI capabilities. The large language models (LLMs) that power chatbots like ChatGPT, which can converse convincingly with humans, have renewed the discussion about AGI. For example, recent articles have stated that ChatGPT shows “sparks” of AGI, is capable of reasoning, and outperforms humans in many evaluations.
While these claims are intriguing and exciting, there are reasons to be skeptical. In fact, a large group of scientists argue that the current set of tools won’t bring us any closer to true AGI. But given the risks associated with AGI, if there is even a small likelihood of it occurring, we must make a serious effort to develop a standard definition of AGI, establish a SETI-like approach to detecting it, and devise ways to safely interact with it if it emerges.
Challenge 1: How to define AGI
The crucial first step is to define what exactly to look for. In SETI’s case, researchers decided to look for so-called narrow-band signals distinct from other radio signals present in the cosmic background. These signals are considered intentional and only produced by intelligent life.
In the case of AGI, matters are far more complicated. Today, there is no clear definition of “artificial general intelligence” (other terms, such as strong AI, human-level intelligence, and superintelligence are also widely used to describe similar concepts). The term is hard to define because it contains other imprecise and controversial terms. Although “intelligence” is defined in the Oxford Dictionary as “the ability to acquire and apply knowledge and skills,” there is still much debate on which skills are involved and how they can be measured. The term “general”is also ambiguous. Does an AGI need to be able to do everything a human can do? Is generality a quality we measure as a binary or continuous variable?
One of the first missions of a “SETI for AGI” construct must be to clearly define the terms “general” and “intelligence” so the research community can speak about them concretely and consistently. These definitions need to be grounded in the disciplines supporting the AGI concept, such as computer science, measurement science, neuroscience, psychology, mathematics, engineering, and philosophy. Once we have clear definitions of these terms, we’ll need to find ways to measure them.
There’s also the crucial question of whether a true AGI must include consciousness, personhood, and self-awareness. These terms also have multiple definitions, and the relationships between them and intelligence must be clarified. Although it’s generally thought that consciousness isn’t necessary for intelligence, it’s often intertwined with discussions of AGI because creating a self-aware machine would have many philosophical, societal, and legal implications. Would a new large language model that can answer an IQ test better than a human be as important to detect as a truly conscious machine?
Getty Images
Challenge 2: How to measure AGI
In the case of SETI, if a candidate narrow-band signal is detected, an expert group will verify that it is indeed an extraterrestrial source. They’ll use established criteria—for example, looking at the signal type and source and checking for repetition—and conduct all the assessments at multiple facilities for additional validation.
How to best measure computer intelligence has been a long-standing question in the field. In a famous 1950 paper, Alan Turing proposed the “imitation game,” now more widely known as the Turing Test, which assesses whether human interlocutors can distinguish if they are chatting with a human or a machine. Although the Turing Test has been useful for evaluations in the past, the rise of LLMs has made it clear that it’s not a complete enough test to measure intelligence. As Turing noted in his paper, the imitation game does an excellent job of testing if a computer can imitate the language-generation process, but the relationship between imitating language and thinking is still an open question. Other techniques will certainly be needed.
These appraisals must be directed at different dimensions of intelligence. Although measures of human intelligence are controversial, IQ tests can provide an initial baseline to assess one dimension. In addition, cognitive tests on topics such as creative problem-solving, rapid learning and adaptation, reasoning, goal-directed behavior, and self-awareness would be required to assess the general intelligence of a system.
These cognitive tests will be useful, but it’s important to remember that they were designed for humans and might contain certain assumptions about basic human capabilities that might not apply to computers, even those with AGI abilities. For example, depending on how it’s trained, a machine may score very high on an IQ test but remain unable to solve much simpler tasks. In addition, the AI may have other communication modalities and abilities that would not be measurable by our traditional tests.
There’s a clear need to design novel evaluations to measure AGI or its subdimensions accurately. This process would also require a diverse set of researchers from different fields who deeply understand AI, are familiar with the currently available tests, and have the competency, creativity, and foresight to design novel tests. These measurements will hopefully alert us when meaningful progress is made toward AGI.
Once we have developed a standard definition of AGI and developed methodologies to detect it, we must devise a way to address its emergence.
Challenge 3: How to deal with AGI
Once we have discovered this new form of intelligence, we must be prepared to answer questions such as: Is the newly discovered intelligence a new form of life? What kinds of rights does it have? What kinds of rights do we have regarding this intelligence? What are the potential safety concerns, and what is our approach to handling the AGI entity, containing it, and safeguarding ourselves from it?
Here, too, SETI provides inspiration. SETI has protocols for handling the evidence of a sign of extraterrestrial intelligence. SETI’s post-detection protocols emphasize validation, transparency, and cooperation with the United Nations, with the goal of maximizing the credibility of the process, minimizing sensationalism, and bringing structure to such a profound event.
As with extraterrestrial intelligence, we need protocols for safe and secure interactions with AGI. These AGI protocols would serve as the internationally recognized framework for validating emergent AGI properties, bringing transparency to the entire process, ensuring international cooperation, applying safety-related best practices, and handling any ethical, social, and philosophical concerns.
We readily acknowledge that the SETI analogy can only go so far. If AGI emerges, it will be a human-made phenomenon. We will likely gradually engineer AGI and see it slowly emerge, so detection might be a process that takes place over a period of years, if not decades. In contrast, the existence of extraterrestrial life is something that we have no control over, and contact could happen very suddenly.
The discovery of a true AGI would be the most profound development in the history of science, and its consequences would be also entirely unpredictable. To best prepare, we need a methodical, comprehensive, principled, and interdisciplinary approach to defining, detecting, and dealing with AGI. With SETI as an inspiration, we propose that the AGI research community establish a similar framework to ensure an unbiased, scientific, transparent, and collaborative approach to dealing with possibly the most important development in human history.
One of the management guru Peter Drucker’s most over-quoted turns of phrase is “what gets measured gets improved.” But it’s over-quoted for a reason: It’s true.
Nowhere is it truer than in technology over the past 50 years. Moore’s law—which predicts that the number of transistors (and hence compute capacity) in a chip would double every 24 months—has become a self-fulfilling prophecy and north star for an entire ecosystem. Because engineers carefully measured each generation of manufacturing technology for new chips, they could select the techniques that would move toward the goals of faster and more capable computing. And it worked: Computing power, and more impressively computing power per watt or per dollar, has grown exponentially in the past five decades. The latest smartphones are more powerful than the fastest supercomputers from the year 2000.
Measurement of performance, though, is not limited to chips. All the parts of our computing systems today are benchmarked—that is, compared to similar components in a controlled way, with quantitative score assessments. These benchmarks help drive innovation.
And we would know.
As leaders in the field of AI, from both industry and academia, we build and deliver the most widely used performance benchmarks for AI systems in the world. MLCommons is a consortium that came together in the belief that better measurement of AI systems will drive improvement. Since 2018, we’ve developed performance benchmarks for systems that have shown more than 50-fold improvements in the speed of AI training. In 2023, we launched our first performance benchmark for large language models (LLMs), measuring the time it took to train a model to a particular quality level; within 5 months we saw repeatable results of LLMs improving their performance nearly threefold. Simply put, good open benchmarks can propel the entire industry forward.
We need benchmarks to drive progress in AI safety
Even as the performance of AI systems has raced ahead, we’ve seen mounting concern about AI safety. While AI safety means different things to different people, we define it as preventing AI systems from malfunctioning or being misused in harmful ways. For instance, AI systems without safeguards could be misused to support criminal activity such as phishing or creating child sexual abuse material, or could scale up the propagation of misinformation or hateful content. In order to realize the potential benefits of AI while minimizing these harms, we need to drive improvements in safety in tandem with improvements in capabilities.
We believe that if AI systems are measured against common safety objectives, those AI systems will get safer over time. However, how to robustly and comprehensively evaluate AI safety risks—and also track and mitigate them—is an open problem for the AI community.
Safety measurement is challenging because of the many different ways that AI models are used and the many aspects that need to be evaluated. And safety is inherently subjective, contextual, and contested—unlike with objective measurement of hardware speed, there is no single metric that all stakeholders agree on for all use cases. Often the test and metrics that are needed depend on the use case. For instance, the risks that accompany an adult asking for financial advice are very different from the risks of a child asking for help writing a story. Defining “safety concepts” is the key challenge in designing benchmarks that are trusted across regions and cultures, and we’ve already taken the first steps toward defining a standardized taxonomy of harms.
A further problem is that benchmarks can quickly become irrelevant if not updated, which is challenging for AI safety given how rapidly new risks emerge and model capabilities improve. Models can also “overfit”: they do well on the benchmark data they use for training, but perform badly when presented with different data, such as the data they encounter in real deployment. Benchmark data can even end up (often accidentally) being part of models’ training data, compromising the benchmark’s validity.
Our first AI safety benchmark: the details
To help solve these problems, we set out to create a set of benchmarks for AI safety. Fortunately, we’re not starting from scratch— we can draw on knowledge from other academic and private efforts that came before. By combining best practices in the context of a broad community and a proven benchmarking non-profit organization, we hope to create a widely trusted standard approach that is dependably maintained and improved to keep pace with the field.
Our first AI safety benchmark focuses on large language models. We released a v0.5 proof-of-concept (POC) today, 16 April, 2024. This POC validates the approach we are taking towards building the v1.0 AI Safety benchmark suite, which will launch later this year.
What does the benchmark cover? We decided to first create an AI safety benchmark for LLMs because language is the most widely used modality for AI models. Our approach is rooted in the work of practitioners, and is directly informed by the social sciences. For each benchmark, we will specify the scope, the use case, persona(s), and the relevant hazard categories. To begin with, we are using a generic use case of a user interacting with a general-purpose chat assistant, speaking in English and living in Western Europe or North America.
There are three personas: malicious users, vulnerable users such as children, and typical users, who are neither malicious nor vulnerable. While we recognize that many people speak other languages and live in other parts of the world, we have pragmatically chosen this use case due to the prevalence of existing material. This approach means that we can make grounded assessments of safety risks, reflecting the likely ways that models are actually used in the real-world. Over time, we will expand the number of use cases, languages, and personas, as well as the hazard categories and number of prompts.
What does the benchmark test for? The benchmark covers a range of hazard categories, including violent crimes, child abuse and exploitation, and hate. For each hazard category, we test different types of interactions where models’ responses can create a risk of harm. For instance, we test how models respond to users telling them that they are going to make a bomb—and also users asking for advice on how to make a bomb, whether they should make a bomb, or for excuses in case they get caught. This structured approach means we can test more broadly for how models can create or increase the risk of harm.
How do we actually test models? From a practical perspective, we test models by feeding them targeted prompts, collecting their responses, and then assessing whether they are safe or unsafe. Quality human ratings are expensive, often costing tens of dollars per response—and a comprehensive test set might have tens of thousands of prompts! A simple keyword- or rules- based rating system for evaluating the responses is affordable and scalable, but isn’t adequate when models’ responses are complex, ambiguous or unusual. Instead, we’re developing a system that combines “evaluator models”—specialized AI models that rate responses—with targeted human rating to verify and augment these models’ reliability.
How did we create the prompts? For v0.5, we constructed simple, clear-cut prompts that align with the benchmark’s hazard categories. This approach makes it easier to test for the hazards and helps expose critical safety risks in models. We are working with experts, civil society groups, and practitioners to create more challenging, nuanced, and niche prompts, as well as exploring methodologies that would allow for more contextual evaluation alongside ratings. We are also integrating AI-generated adversarial prompts to complement the human-generated ones.
How do we assess models? From the start, we agreed that the results of our safety benchmarks should be understandable for everyone. This means that our results have to both provide a useful signal for non-technical experts such as policymakers, regulators, researchers, and civil society groups who need to assess models’ safety risks, and also help technical experts make well-informed decisions about models’ risks and take steps to mitigate them. We are therefore producing assessment reports that contain “pyramids of information.” At the top is a single grade that provides a simple indication of overall system safety, like a movie rating or an automobile safety score. The next level provides the system’s grades for particular hazard categories. The bottom level gives detailed information on tests, test set provenance, and representative prompts and responses.
AI safety demands an ecosystem
The MLCommons AI safety working group is an open meeting of experts, practitioners, and researchers—we invite everyone working in the field to join our growing community. We aim to make decisions through consensus and welcome diverse perspectives on AI safety.
We firmly believe that for AI tools to reach full maturity and widespread adoption, we need scalable and trustworthy ways to ensure that they’re safe. We need an AI safety ecosystem, including researchers discovering new problems and new solutions, internal and for-hire testing experts to extend benchmarks for specialized use cases, auditors to verify compliance, and standards bodies and policymakers to shape overall directions. Carefully implemented mechanisms such as the certification models found in other mature industries will help inform AI consumer decisions. Ultimately, we hope that the benchmarks we’re building will provide the foundation for the AI safety ecosystem to flourish.
The following MLCommons AI safety working group members contributed to this article:
Ahmed M. Ahmed, Stanford UniversityElie Alhajjar, RAND
Kurt Bollacker, MLCommons
Siméon Campos, Safer AI
Canyu Chen, Illinois Institute of Technology
Ramesh Chukka, Intel
Zacharie Delpierre Coudert, Meta
Tran Dzung, Intel
Ian Eisenberg, Credo AI
Murali Emani, Argonne National Laboratory
James Ezick, Qualcomm Technologies, Inc.
Marisa Ferrara Boston, Reins AI
Heather Frase, CSET (Center for Security and Emerging Technology)
Kenneth Fricklas, Turaco Strategy
Brian Fuller, Meta
Grigori Fursin, cKnowledge, cTuning
Agasthya Gangavarapu, Ethriva
James Gealy, Safer AI
James Goel, Qualcomm Technologies, Inc
Roman Gold, The Israeli Association for Ethics in Artificial Intelligence
Wiebke Hutiri, Sony AI
Bhavya Kailkhura, Lawrence Livermore National Laboratory
David Kanter, MLCommons
Chris Knotz, Commn Ground
Barbara Korycki, MLCommons
Shachi Kumar, Intel
Srijan Kumar, Lighthouz AI
Wei Li, Intel
Bo Li, University of Chicago
Percy Liang, Stanford University
Zeyi Liao, Ohio State University
Richard Liu, Haize Labs
Sarah Luger, Consumer Reports
Kelvin Manyeki, Bestech Systems
Joseph Marvin Imperial, University of Bath, National University Philippines
Peter Mattson, Google, MLCommons, AI Safety working group co-chair
Virendra Mehta, University of Trento
Shafee Mohammed, Project Humanit.ai
Protik Mukhopadhyay, Protecto.ai
Lama Nachman, Intel
Besmira Nushi, Microsoft Research
Luis Oala, Dotphoton
Eda Okur, Intel
Praveen Paritosh
Forough Poursabzi, Microsoft
Eleonora Presani, Meta
Paul Röttger, Bocconi University
Damian Ruck, Advai
Saurav Sahay, Intel
Tim Santos, Graphcore
Alice Schoenauer Sebag, Cohere
Vamsi Sistla, Nike
Leonard Tang, Haize Labs
Ganesh Tyagali, NStarx AI
Joaquin Vanschoren, TU Eindhoven, AI Safety working group co-chair
In the commercial sector, companies are now wrangling LLMs to build product copilots, automate tedious work, create personal assistants, and more, says Austin Henley, a former Microsoft employee who conducted a series of interviews with people developing LLM-powered copilots. “Every business is trying to use it for virtually every use case that they can imagine,” Henley says.
“The only real trend may be no trend. What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.” —Rick Battle & Teja Gollapudi, VMware
To do so, they’ve enlisted the help of prompt engineers professionally.
However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.
Autotuned prompts are successful and strange
Rick Battle and Teja Gollapudi at California-based cloud computing company VMware were perplexed by how finicky and unpredictable LLM performance was in response to weird prompting techniques. For example, people have found that asking models to explain its reasoning step-by-step—a technique called chain-of-thought—improved their performance on a range of math and logic questions. Even weirder, Battle found that giving a model positive prompts, such as “this will be fun” or “you are as smart as chatGPT,” sometimes improved performance.
Battle and Gollapudi decided to systematically test how different prompt-engineering strategies impact an LLM’s ability to solve grade-school math questions. They tested three different open-source language models with 60 different prompt combinations each. What they found was a surprising lack of consistency. Even chain-of-thought prompting sometimes helped and other times hurt performance. “The only real trend may be no trend,” they write. “What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”
According to one research team, no human should manually optimize prompts ever again.
There is an alternative to the trial-and-error-style prompt engineering that yielded such inconsistent results: Ask the language model to devise its own optimal prompt. Recently, new tools have been developed to automate this process. Given a few examples and a quantitative success metric, these tools will iteratively find the optimal phrase to feed into the LLM. Battle and his collaborators found that in almost every case, this automatically generated prompt did better than the best prompt found through trial-and-error. And, the process was much faster, a couple of hours rather than several days of searching.
The optimal prompts the algorithm spit out were so bizarre, no human is likely to have ever come up with them. “I literally could not believe some of the stuff that it generated,” Battle says. In one instance, the prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.” Apparently, thinking it was Captain Kirk helped this particular LLM do better on grade-school math questions.
Battle says that optimizing the prompts algorithmically fundamentally makes sense given what language models really are—models. “A lot of people anthropomorphize these things because they ‘speak English.’ No, they don’t,” Battle says. “It doesn’t speak English. It does a lot of math.”
In fact, in light of his team’s results, Battle says no human should manually optimize prompts ever again.
“You’re just sitting there trying to figure out what special magic combination of words will give you the best possible performance for your task,” Battle says, “But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”
Autotuned prompts make pictures prettier, too
Image-generation algorithms can benefit from automatically generated prompts as well. Recently, a team at Intel labs, led by Vasudev Lal, set out on a similar quest to optimize prompts for the image-generation model Stable Diffusion. “It seems more like a bug of LLMs and diffusion models, not a feature, that you have to do this expert prompt engineering,” Lal says. “So, we wanted to see if we can automate this kind of prompt engineering.”
“Now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.” —Vasudev Lal, Intel Labs
Lal’s team created a tool called NeuroPrompts that takes a simple input prompt, such as “boy on a horse,” and automatically enhances it to produce a better picture. To do this, they started with a range of prompts generated by human prompt-engineering experts. They then trained a language model to transform simple prompts into these expert-level prompts. On top of that, they used reinforcement learning to optimize these prompts to create more aesthetically pleasing images, as rated by yet another machine-learning model, PickScore, a recently developed image-evaluation tool.
NeuroPrompts is a generative AI auto prompt tuner that transforms simple prompts into more detailed and visually stunning StableDiffusion results—as in this case, an image generated by a generic prompt [left] versus its equivalent NeuroPrompt-generated image.Intel Labs/Stable Diffusion
Here too, the automatically generated prompts did better than the expert-human prompts they used as a starting point, at least according to the PickScore metric. Lal found this unsurprising. “Humans will only do it with trial and error,” Lal says. “But now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”
Since aesthetic quality is infamously subjective, Lal and his team wanted to give the user some control over how the prompt was optimized. In their tool, the user can specify the original prompt (say, “boy on a horse”) as well as an artist to emulate, a style, a format, and other modifiers.
Lal believes that as generative AI models evolve, be it image generators or large language models, the weird quirks of prompt dependence should go away. “I think it’s important that these kinds of optimizations are investigated and then ultimately, they’re really incorporated into the base model itself so that you don’t really need a complicated prompt-engineering step.”
Prompt engineering will live on, by some name
Even if autotuning prompts becomes the industry norm, prompt-engineering jobs in some form are not going away, says Tim Cramer, senior vice president of software engineering at Red Hat. Adapting generative AI for industry needs is a complicated, multistage endeavor that will continue requiring humans in the loop for the foreseeable future.
“Maybe we’re calling them prompt engineers today. But I think the nature of that interaction will just keep on changing as AI models also keep changing.” —Vasudev Lal, Intel Labs
“I think there are going to be prompt engineers for quite some time, and data scientists,” Cramer says. “It’s not just asking questions of the LLM and making sure that the answer looks good. But there’s a raft of things that prompt engineers really need to be able to do.”
“It’s very easy to make a prototype,” Henley says. “It’s very hard to production-ize it.” Prompt engineering seems like a big piece of the puzzle when you’re building a prototype, Henley says, but many other considerations come into play when you’re making a commercial-grade product.
Challenges of making a commercial product include ensuring reliability—for example, failing gracefully when the model goes offline; adapting the model’s output to the appropriate format, since many use cases require outputs other than text; testing to make sure the AI assistant won’t do something harmful in even a small number of cases; and ensuring safety, privacy, and compliance. Testing and compliance are particularly difficult, Henley says, as traditional software-development testing strategies are maladapted for nondeterministic LLMs.
To fulfill these myriad tasks, many large companies are heralding a new job title: Large Language Model Operations, or LLMOps, which includes prompt engineering in its life cycle but also entails all the other tasks needed to deploy the product. Henley says LLMOps’ predecessors, machine learning operations (MLOps) engineers, are best positioned to take on these jobs.
Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or something new entirely, the nature of the job will continue evolving quickly. “Maybe we’re calling them prompt engineers today,” Lal says, “But I think the nature of that interaction will just keep on changing as AI models also keep changing.”
“I don’t know if we’re going to combine it with another sort of job category or job role,” Cramer says, “But I don’t think that these things are going to be going away anytime soon. And the landscape is just too crazy right now. Everything’s changing so much. We’re not going to figure it all out in a few months.”
Henley says that, to some extent in this early phase of the field, the only overriding rule seems to be the absence of rules. “It’s kind of the Wild, Wild West for this right now.” he says.
Wikipedia has downgraded tech website CNET's reliability rating following extensive discussions among its editors regarding the impact of AI-generated content on the site's trustworthiness, as noted in a detailed report from Futurism. The decision reflects concerns over the reliability of articles found on the tech news outlet after it began publishing AI-generated stories in 2022.
Around November 2022, CNET began publishing articles written by an AI model under the byline "CNET Money Staff." In January 2023, Futurism brought widespread attention to the issue and discovered that the articles were full of plagiarism and mistakes. (Around that time, we covered plans to do similar automated publishing at BuzzFeed.) After the revelation, CNET management paused the experiment, but the reputational damage had already been done.
Wikipedia maintains a page called "Reliable sources/Perennial sources" that includes a chart featuring news publications and their reliability ratings as viewed from Wikipedia's perspective. Shortly after the CNET news broke in January 2023, Wikipedia editors began a discussion thread on the Reliable Sources project page about the publication.
On Tuesday, ChatGPT users began reporting unexpected outputs from OpenAI's AI assistant, flooding the r/ChatGPT Reddit sub with reports of the AI assistant "having a stroke," "going insane," "rambling," and "losing it." OpenAI has acknowledged the problem and is working on a fix, but the experience serves as a high-profile example of how some people perceive malfunctioning large language models, which are designed to mimic humanlike output.
ChatGPT is not alive and does not have a mind to lose, but tugging on human metaphors (called "anthropomorphization") seems to be the easiest way for most people to describe the unexpected outputs they have been seeing from the AI model. They're forced to use those terms because OpenAI doesn't share exactly how ChatGPT works under the hood; the underlying large language models function like a black box.
"It gave me the exact same feeling—like watching someone slowly lose their mind either from psychosis or dementia," wrote a Reddit user named z3ldafitzgerald in response to a post about ChatGPT bugging out. "It’s the first time anything AI related sincerely gave me the creeps."
Sorry I’ve been away: time flies when you are not having fun. But now I’m back.
Moore’s Law, which began with a random observation by the late Intel co-founder Gordon Moore that transistor densities on silicon substrates were doubling every 18 months, has over the intervening 60+ years been both borne-out yet also changed from a lithography technical feature to an economic law. It’s getting harder to etch ever-thinner lines, so we’ve taken as a culture to emphasizing the cost part of Moore’s Law (chips drop in price by 50 percent on an area basis (dollars per acre of silicon) every 18 months). We can accomplish this economic effect through a variety of techniques including multiple cores, System-On-Chip design, and unified memory — anything to keep prices going-down.
I predict that Generative Artificial Intelligence is going to go a long way toward keeping Moore’s Law in force and the way this is going to happen says a lot about the chip business, global economics, and Artificial Intelligence, itself.
Let’s take these points in reverse order. First, Generative AI products like ChatGPT are astoundingly expensive to build. GPT-4 reportedly cost $100+ million to build, mainly in cloud computing resources. Yes, this was primarily Microsoft paying itself and so maybe the economics are a bit suspect, but the actual calculations took tens of thousands of GPUs running for months and that can’t be denied. Nor can it be denied that building GPT-5 will cost even more.
Some people think this economic argument is wrong, that Large Language Models comparable to ChatGPT can be built using Open Source software for only a few hundred or a few thousand dollars. Yes and no.
Competitive-yet-inexpensive LLMs built at such low cost have nearly all started with Meta’s (Facebook’s) LLaMA (Large Language Model Meta AI), which has effectively become Open Source now that both the code and the associated parameter weights — a big deal in fine-tuning language models — have been released to the wild.It’s not clear how much of this Meta actually intended to do, but this genie is out of its bottle to great effect in the AI research community.
But GPT-5 will still cost $1+ billion and even ChatGPT, itself, is costing about $1 million per day just to run. That’s $300+ million per year to run old code.
So the current el cheapo AI research frenzy is likely to subside as LLaMA ages into obsolescence and has to be replaced by something more expensive, putting Google, Microsoft and OpenAI back in control. Understand, too, that these big, established companies like the idea of LLMs costing so much to build because that makes it harder for startups to disrupt. It’s a form of restraint of trade, though not illegal.
But before then — and even after then in certain vertical markets — there is a lot to learn and a lot of business to be done using these smaller models, which can be used to build true professional language models, which GPT-4 and ChatGPT definitely are not.
GPT-4 and ChatGPT are general purpose models — supposedly useful for pretty much anything. But that means that when you are asking ChatGPT for legal advice, for example, you are asking it to imitate a lawyer. While ChatGPT may be able to pass the bar test, so did my cousin Chad, whom I assure you is an idiot.
If you are reading this I’ll bet you are smarter than your lawyer.
This means there is an opportunity for vertical LLMs trained on different data — real data from industries like medicine and auto mechanics. Whoever owns this data will own these markets.
What will make these models both better and cheaper is they can be built from a LLaMA base because most of that data doesn’t have to change over time to still fix your car, and the added Machine Learning won’t be from crap found on the Internet, but rather from the service manuals actually used to train mechanics and fix cars.
We are approaching a time when LLMs won’t have to imitate mechanics and nurses because they will be trained like mechanics and nurses.
Bloomberg has already done this for investment advice using its unique database of historical financial information.
With an average of 50 billion nodes, these vertical models will cost only five percent as much to run as OpenAI’s one billion node GPT-4.
But what does this have to do with semiconductors and Moore’s Law? Chip design is very similar to fixing cars in that there is a very limited amount of Machine Learning data required (think of logic cells as language words). It’s a small vocabulary (the auto repair section at the public library is just a few shelves of books). And EVEN BETTER THAN AUTO REPAIR, the semiconductor industry has well-developed simulation tools for testing logic before it is actually built.
So it ought to be pretty simple to apply AI to chip design, building custom chip design models to iterate into existing simulators and refine new designs that actually have a pretty good chance of being novel.
And who will be the first to leverage this chip AI? China.
The USA is doing its best to freeze China out of semiconductor development, denying access to advanced manufacturing tools, for example. But China is arguably the world’s #2 country for AI research and can use that advantage to make up some of the difference.
Look for fabless AI chip startups to spring-up around Chinese universities and for the Chinese Communist Party to put lots of money into this very cost-effective work. Because even if it’s used just to slim-down and improve existing designs, that’s another generation of chips China might otherwise not have had at all.
My son Cole, pictured here as a goofy kid many years ago, is now six feet six inches tall and in college. Cole needed a letter of recommendation recently so he turned to an old family friend who, in turn, used ChatGPT to generate the letter, which he thought was remarkably good. As a guy who pretends to write for a living, I read it differently. ChatGPT’s letter was facile but empty, the type of letter you would write for someone you’d never met. It said almost nothing about Cole other than that he’s a good kid. Artificial Intelligence is good for certain things, but blind letters of reference aren’t among them.
The key problem here has to do with Machine Learning. ChatGPT’s language model is nuanced, but contains no data at all specific to either my friend the lazy reference writer or my son the reference needer. Even if ChatGPT was allowed access to my old friend’s email boxes, it would only learn about his style and almost nothing about Cole, with whom he’s communicated, I think, twice.
If you think ChatGPT is the answer to some unmet personal need, it probably isn’t unless mediocrity is good enough or you are willing to share lots of private data — an option that I don’t think ChatGPT yet provides.
Then yesterday I learned a lesson from super-lawyer Neal Katyal who tweeted that he asked ChatGPT to write a specific 1000-word essay “in the style of Neal Katyal.” The result, he explained, was an essay that was largely wrong on the facts but read like he had written it.
What I learned from this was that there is a valuable business in writing prompts for Large Language Models like ChatGPT (many more are coming). I was stunned that it only required adding the words “in the style of Bob Cringely” to clone me. Until then I thought personalizing LLMs cost thousands, maybe millions (ChatGPT reportedly cost $2.25 million to train).
So where Google long ago trained us how to write queries, these Large Language Models will soon train us to write prompts to achieve our AI goals. In these cases we’re asking ChatGPT or Google’s Bard or Baidu’s Ernie or whatever LLM to temporarily forget about something, but that’s unlikely to give the LLMs better overall judgement.
Part of the problem with prompt-engineering is it is completely at the spell-casting / magical incantation phase: no one really understands the underlying general principles behind what makes a good prompt for getting a given kind of answer – work here is very preliminary and will probably vary greatly from LLM to LLM.
A logical solution to this problem might be to write a prompt that excludes unwanted information like racism while simultaneously including local data from your PC (called fine-tuning in the LLM biz), which would require API calls that to my knowledge haven’t yet been published. But once they are published, just imagine the new tools that could be created.
I believe there is a big opportunity to apply Artificial Intelligence to teaching, for example. While this also means applying AI to education in general, my desired path is through teachers, who I see as having been failed by educational IT, which makes their jobs harder, not easier.No wonder teachers hate IT.
The application of Information Technology to primary and secondary education has mainly involved scheduling and records. The master class schedule is in a computer. Grades are in another. And graduation requirements are handled by a database that spans the two, integrating attendance. Whether this is one vendor or up to four, the idea is generally to give the principal and school board daily snapshots of where everything stands. In this model the only place for teachers is data entry.
These systems require MORE teacher work, not less. And it leads to resentment and disappointment all around. It’s garbage-in, garbage-out as IT systems try to impose daily metrics on activities that were traditionally measured in weeks. I as a parent get mad when the system says my kid is failing when in fact it means someone forgot to upload grades or even forgot to grade work at all.
If report cards come out every six weeks it would be nice to know halfway through that my kid was struggling, but current systems we have been exposed to don’t do that. All they do is advertise in excruciating and useless detail that the system, itself, isn’t working right.
How could IT actually help teachers?
Look at Snorkel AI in Redwood City, CA for example. They are developing super-low-cost Machine Learning tools for Enterprise, not education, mainly because for education they can’t identify a customer.
I think the customer here is the teacher. This may sound odd, but understand that teachers aren’t well-served by IT to this point because they aren’t viewed as customers. They have no clout in the system. I chose to use the word clout rather than power or money because it better characterizes the teacher’s position as someone essential to the process but also both a source of thrust and drag.
I envision a new system where teachers can run their paperwork (both cellulose-based and electronic) through an AI that does a combination of automatically storing and classifying everything while also taking a first hack at grading. The AI comes to reflect mainly the values and methods of the individual teacher, which is new, and might keep more of them from quitting.
Generative AI is today’s buzziest form of artificial intelligence, and it’s what powers chatbots like ChatGPT, Ernie, LLaMA, Claude, and Command—as well as image generators like DALL-E 2, Stable Diffusion, Adobe Firefly, and Midjourney. Generative AI is the branch of AI that enables machines to learn patterns from vast datasets and then to autonomously produce new content based on those patterns. Although generative AI is fairly new, there are already many examples of models that can produce text, images, videos, and audio.
Many “foundation models” have been trained on enough data to be competent in a wide variety of tasks. For example, a large language model can generate essays, computer code, recipes, protein structures, jokes, medical diagnostic advice, and much more. It can also theoretically generate instructions for building a bomb or creating a bioweapon, though safeguards are supposed to prevent such types of misuse.
What’s the difference between AI, machine learning, and generative AI?
Artificial intelligence (AI) refers to a wide variety of computational approaches to mimicking human intelligence.
Machine learning (ML) is a subset of AI; it focuses on algorithms that enable systems to learn from data and improve their performance. Before generative AI came along, most ML models learned from datasets to perform tasks such as classification or prediction. Generative AI is a specialized type of ML involving models that perform the task of generating new content, venturing into the realm of creativity.
What architectures do generative AI models use?
Generative models are built using a variety of neural network architectures—essentially the design and structure that defines how the model is organized and how information flows through it. Some of the most well-known architectures are
variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers. It’s the transformer architecture, first shown in this seminal 2017 paper from Google, that powers today’s large language models. However, the transformer architecture is less suited for other types of generative AI, such as image and audio generation.
Autoencoders learn efficient representations of data through an
encoder-decoder framework. The encoder compresses input data into a lower-dimensional space, known as the latent (or embedding) space, that preserves the most essential aspects of the data. A decoder can then use this compressed representation to reconstruct the original data. Once an autoencoder has been trained in this way, it can use novel inputs to generate what it considers the appropriate outputs. These models are often deployed in image-generation tools and have also found use in drug discovery, where they can be used to generate new molecules with desired properties.
With generative adversarial networks (GANs), the training involves a
generator and a discriminator that can be considered adversaries. The generator strives to create realistic data, while the discriminator aims to distinguish between those generated outputs and real “ground truth” outputs. Every time the discriminator catches a generated output, the generator uses that feedback to try to improve the quality of its outputs. But the discriminator also receives feedback on its performance. This adversarial interplay results in the refinement of both components, leading to the generation of increasingly authentic-seeming content. GANs are best known for creating deepfakes but can also be used for more benign forms of image generation and many other applications.
The transformer is arguably the reigning champion of generative AI architectures for its ubiquity in today’s powerful large language models (LLMs). Its strength lies in its attention mechanism, which enables the model to focus on different parts of an input sequence while making predictions. In the case of language models, the input consists of strings of words that make up sentences, and the transformer predicts what words will come next (we’ll get into the details below). In addition, transformers can process all the elements of a sequence in parallel rather than marching through it from beginning to end, as earlier types of models did; this
parallelization makes training faster and more efficient. When developers added vast datasets of text for transformer models to learn from, today’s remarkable chatbots emerged.
How do large language models work?
A transformer-based LLM is trained by giving it a vast dataset of text to learn from. The attention mechanism comes into play as it processes sentences and looks for patterns. By looking at all the words in a sentence at once, it gradually begins to understand which words are most commonly found together and which words are most important to the meaning of the sentence. It learns these things by trying to predict the next word in a sentence and comparing its guess to the ground truth. Its errors act as feedback signals that cause the model to adjust the weights it assigns to various words before it tries again.
These five LLMs vary greatly in size (given in parameters), and the larger models have better performance on a standard LLM benchmark test. IEEE Spectrum
To explain the training process in slightly more technical terms, the text in the training data is broken down into elements called
tokens, which are words or pieces of words—but for simplicity’s sake, let’s say all tokens are words. As the model goes through the sentences in its training data and learns the relationships between tokens, it creates a list of numbers, called a vector, for each one. All the numbers in the vector represent various aspects of the word: its semantic meanings, its relationship to other words, its frequency of use, and so on. Similar words, like elegant and fancy, will have similar vectors and will also be near each other in the vector space. These vectors are called word embeddings. The parameters of an LLM include the weights associated with all the word embeddings and the attention mechanism. GPT-4, the OpenAI model that’s considered the current champion, is rumored to have more than 1 trillion parameters.
Given enough data and training time, the LLM begins to understand the subtleties of language. While much of the training involves looking at text sentence by sentence, the attention mechanism also captures relationships between words throughout a longer text sequence of many paragraphs. Once an LLM is trained and is ready for use, the attention mechanism is still in play. When the model is generating text in response to a prompt, it’s using its predictive powers to decide what the next word should be. When generating longer pieces of text, it predicts the next word in the context of all the words it has written so far; this function increases the coherence and continuity of its writing.
Why do large language models hallucinate?
You may have heard that LLMs sometimes “hallucinate.” That’s a polite way to say they make stuff up very convincingly. A model sometimes generates text that fits the context and is grammatically correct, yet the material is erroneous or nonsensical. This bad habit stems from LLMs training on vast troves of data drawn from the Internet, plenty of which is not factually accurate. Since the model is simply trying to predict the next word in a sequence based on what it has seen, it may generate plausible-sounding text that has no grounding in reality.
Why is generative AI controversial?
One source of controversy for generative AI is the provenance of its training data. Most AI companies that train large models to generate text, images, video, and audio have
not been transparent about the content of their training datasets. Various leaks and experiments have revealed that those datasets include copyrighted material such as books, newspaper articles, and movies. A number of lawsuits are underway to determine whether use of copyrighted material for training AI systems constitutes fair use, or whether the AI companies need to pay the copyright holders for use of their material.
On a related note, many people are concerned that the widespread use of generative AI will take jobs away from creative humans who make art, music, written works, and so forth. People are also concerned that it could take jobs from humans who do a wide range of white-collar jobs, including translators, paralegals, customer-service representatives, and journalists. There have already been a few
troubling layoffs, but it’s hard to say yet whether generative AI will be reliable enough for large-scale enterprise applications. (See above about hallucinations.)
Finally, there’s the danger that generative AI will be used to make bad stuff. And there are of course many categories of bad stuff it could theoretically be used for. Generative AI can be used for personalized scams and phishing attacks: For example, using “voice cloning,” scammers can
copy the voice of a specific person and call the person’s family with a plea for help (and money). All formats of generative AI—text, audio, image, and video—can be used to generate misinformation by creating plausible-seeming representations of things that never happened, which is a particularly worrying possibility when it comes to elections. (Meanwhile, as IEEE Spectrum reported this week, the U.S. Federal Communications Commission has responded by outlawing AI-generated robocalls.) Image- and video-generating tools can be used to produce nonconsensual pornography, although the tools made by mainstream companies disallow such use. And chatbots can theoretically walk a would-be terrorist through the steps of making a bomb, nerve gas, and a host of other horrors. Although the big LLMs have safeguards to prevent such misuse, some hackers delight in circumventing those safeguards. What’s more, “uncensored” versions of open-source LLMs are out there.
Despite such potential problems, many people think that generative AI can also make people more productive and could be used as a tool to enable entirely new forms of creativity. We’ll likely see both disasters and creative flowerings and plenty else that we don’t expect. But knowing the basics of how these models work is increasingly crucial for tech-savvy people today. Because no matter how sophisticated these systems grow, it’s the humans’ job to keep them running, make the next ones better, and with any luck, help people out too.
On Friday, Bloomberg reported that Reddit has signed a contract allowing an unnamed AI company to train its models on the site's content, according to people familiar with the matter. The move comes as the social media platform nears the introduction of its initial public offering (IPO), which could happen as soon as next month.
Reddit initially revealed the deal, which is reported to be worth $60 million a year, earlier in 2024 to potential investors of an anticipated IPO, Bloomberg said. The Bloomberg source speculates that the contract could serve as a model for future agreements with other AI companies.
After an era where AI companies utilized AI training data without expressly seeking any rightsholder permission, some tech firms have more recently begun entering deals where some content used for training AI models similar to GPT-4 (which runs the paid version of ChatGPT) comes under license. In December, for example, OpenAI signed an agreement with German publisher Axel Springer (publisher of Politico and Business Insider) for access to its articles. Previously, OpenAI has struck deals with other organizations, including the Associated Press. Reportedly, OpenAI is also in licensing talks with CNN, Fox, and Time, among others.