Last month when Google introduced its new AI search tool, called AI Overviews, the company seemed confident that it had tested the tool sufficiently, noting in the announcement that “people have already used AI Overviews billions of times through our experiment in Search Labs.” The tool doesn’t just return links to Web pages, as in a typical Google search, but returns an answer that it has generated based on various sources, which it links to below the answer. But immediately after the launch users began posting examples of extremely wrong answers, including a pizza recipe that included glue and the interesting fact that a dog has played in the NBA.
Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory.
While the pizza recipe is unlikely to convince anyone to squeeze on the Elmer’s, not all of AI Overview’s extremely wrong answers are so obvious—and some have the potential to be quite harmful. Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory and has a new book out about the online propagandists who “turn lies into reality.” She has studied the spread of medical misinformation via social media, so IEEE Spectrum spoke to her about whether AI search is likely to bring an onslaught of erroneous medical advice to unwary users.
I know you’ve been tracking disinformation on the Web for many years. Do you expect the introduction of AI-augmented search tools like Google’s AI Overviews to make the situation worse or better?
Renée DiResta: It’s a really interesting question. There are a couple of policies that Google has had in place for a long time that appear to be in tension with what’s coming out of AI-generated search. That’s made me feel like part of this is Google trying to keep up with where the market has gone. There’s been an incredible acceleration in the release of generative AI tools, and we are seeing Big Tech incumbents trying to make sure that they stay competitive. I think that’s one of the things that’s happening here.
We have long known that hallucinations are a thing that happens with large language models. That’s not new. It’s the deployment of them in a search capacity that I think has been rushed and ill-considered because people expect search engines to give them authoritative information. That’s the expectation you have on search, whereas you might not have that expectation on social media.
There are plenty of examples of comically poor results from AI search, things like how many rocks we should eat per day [a response that was drawn for an Onion article]. But I’m wondering if we should be worried about more serious medical misinformation. I came across one blog post about Google’s AI Overviews responses about stem-cell treatments. The problem there seemed to be that the AI search tool was sourcing its answers from disreputable clinics that were offering unproven treatments. Have you seen other examples of that kind of thing?
DiResta: I have. It’s returning information synthesized from the data that it’s trained on. The problem is that it does not seem to be adhering to the same standards that have long gone into how Google thinks about returning search results for health information. So what I mean by that is Google has, for upwards of 10 years at this point, had a search policy called Your Money or Your Life. Are you familiar with that?
I don’t think so.
DiResta: Your Money or Your Life acknowledges that for queries related to finance and health, Google has a responsibility to hold search results to a very high standard of care, and it’s paramount to get the information correct. People are coming to Google with sensitive questions and they’re looking for information to make materially impactful decisions about their lives. They’re not there for entertainment when they’re asking a question about how to respond to a new cancer diagnosis, for example, or what sort of retirement plan they should be subscribing to. So you don’t want content farms and random Reddit posts and garbage to be the results that are returned. You want to have reputable search results.
That framework of Your Money or Your Life has informed Google’s work on these high-stakes topics for quite some time. And that’s why I think it’s disturbing for people to see the AI-generated search results regurgitating clearly wrong health information from low-quality sites that perhaps happened to be in the training data.
So it seems like AI overviews is not following that same policy—or that’s what it appears like from the outside?
DiResta: That’s how it appears from the outside. I don’t know how they’re thinking about it internally. But those screenshots you’re seeing—a lot of these instances are being traced back to an isolated social media post or a clinic that’s disreputable but exists—are out there on the Internet. It’s not simply making things up. But it’s also not returning what we would consider to be a high-quality result in formulating its response.
I saw that Google responded to some of the problems with a blog post saying that it is aware of these poor results and it’s trying to make improvements. And I can read you the one bullet point that addressed health. It said, “For topics like news and health, we already have strong guardrails in place. In the case of health, we launched additional triggering refinements to enhance our quality protections.” Do you know what that means?
DiResta: That blog posts is an explanation that [AI Overviews] isn’t simply hallucinating—the fact that it’s pointing to URLs is supposed to be a guardrail because that enables the user to go and follow the result to its source. This is a good thing. They should be including those sources for transparency and so that outsiders can review them. However, it is also a fair bit of onus to put on the audience, given the trust that Google has built up over time by returning high-quality results in its health information search rankings.
I know one topic that you’ve tracked over the years has been disinformation about vaccine safety. Have you seen any evidence of that kind of disinformation making its way into AI search?
DiResta: I haven’t, though I imagine outside research teams are now testing results to see what appears. Vaccines have been so much a focus of the conversation around health misinformation for quite some time, I imagine that Google has had people looking specifically at that topic in internal reviews, whereas some of these other topics might be less in the forefront of the minds of the quality teams that are tasked with checking if there are bad results being returned.
What do you think Google’s next moves should be to prevent medical misinformation in AI search?
DiResta: Google has a perfectly good policy to pursue. Your Money or Your Life is a solid ethical guideline to incorporate into this manifestation of the future of search. So it’s not that I think there’s a new and novel ethical grounding that needs to happen. I think it’s more ensuring that the ethical grounding that exists remains foundational to the new AI search tools.
Stephen Cass: Hello. I’m Stephen Cass, Special Projects Director at IEEE Spectrum. Before starting today’s episode hosted by Eliza Strickland, I wanted to give you all listening out there some news about this show.
This is our last episode of Fixing the Future. We’ve really enjoyed bringing you some concrete solutions to some of the world’s toughest problems, but we’ve decided we’d like to be able to go deeper into topics than we can in the course of a single episode. So we’ll be returning later in the year with a program of limited series that will enable us to do those deep dives into fascinating and challenging stories in the world of technology. I want to thank you all for listening and I hope you’ll join us again. And now, on to today’s episode.
Eliza Strickland: Hi, I’m Eliza Strickland for IEEE Spectrum‘s Fixing the Future podcast. Before we start, I want to tell you that you can get the latest coverage from some of Spectrum’s most important beats, including AI, climate change, and robotics, by signing up for one of our free newsletters. Just go to spectrum.IEEE.org/newsletters to subscribe.
Around the world, about 60 countries are contaminated with land mines and unexploded ordnance, and Ukraine is the worst off. Today, about a third of its land, an area the size of Florida, is estimated to be contaminated with dangerous explosives. My guest today is Gabriel Steinberg, who co-founded both the nonprofit Demining Research Community and the startup Safe Pro AI with his friend, Jasper Baur. Their technology uses drones and artificial intelligence to radically speed up the process of finding land mines and other explosives. Okay, Gabriel, thank you so much for joining me on Fixing the Future today.
Gabriel Steinberg: Yeah, thank you for having me.
Strickland: So I want to start by hearing about the typical process for demining, and so the standard operating procedure. What tools do people use? How long does it take? What are the risks involved? All that kind of stuff.
Steinberg: Sure. So humanitarian demining hasn’t changed significantly. There’s been evolutions, of course, since its inception and about the end of World War I. But mostly, the processes have been the same. People stand from a safe location and walk around an area in areas that they know are safe, and try to get as much intelligence about the contamination as they can. They ask villagers or farmers, people who work around the area and live around the area, about accidents and potential sightings of minefields and former battle positions and stuff. The result of this is a very general idea, a polygon, of where the contamination is. After that polygon and some prioritization based on danger to civilians and economic utility, the field goes into clearance. The first part is the non-technical survey, and then this is clearance. Clearance happens one of three ways, usually, but it always ends up with a person on the ground basically doing extreme gardening. They dig out a certain standard amount of the soil, usually 13 centimeters. And with a metal detector, they walk around the field and a mine probe. They find the land mines and nonexploded ordnance. So that always is how it ends.
To get to that point, you can also use mechanical assets, which are large tillers, and sometimes dogs and other animals are used to walk in lanes across the contaminated polygon to sniff out the land mines and tell the clearance operators where the land mines are.
Strickland: How do you hope that your technology will change this process?
Steinberg: Well, my technology is a drone-based mapping solution, basically. So we provide a software to the humanitarian deminers. They are already flying drones over these areas. Really, it started ramping up in Ukraine. The humanitarian demining organizations have started really adopting drones just because it’s such a massive problem. The extent is so extreme that they need to innovate. So we provide AI and mapping software for the deminers to analyze their drone imagery much more effectively. We hope that this process, or our software, will decrease the amount of time that deminers use to analyze the imagery of the land, thereby more quickly and more effectively constraining the areas with the most contamination. So if you can constrain an area, a polygon with a certainty of contamination and a high density of contamination, then you can deploy the most expensive parts of the clearance process, which are the humans and the machines and the dogs. You can deploy them to a very specific area. You can much more cost-effectively and efficiently demine large areas.
Strickland: Got it. So it doesn’t replace the humans walking around with metal detectors and dogs, but it gets them to the right spots faster.
Steinberg: Exactly. Exactly. At the moment, there is no conception of replacing a human in demining operations, and people that try to push that eventuality are usually disregarded pretty quickly.
Strickland: How did you and your co-founder, Jasper, first start experimenting with the use of drones and AI for detecting explosives?
Steinberg: So it started in 2016 with my partner, Jasper Baur, doing a research project at Binghamton University in the remote sensing and geophysics lab. And the project was to detect a specific anti-personnel land mine, thePFM-1. Then found— it’s a Russian-made land mine. It was previously found in Afghanistan. It still is found in Afghanistan, but it’s found in much higher quantities right now in Ukraine. And so his project was to detect the PFM-1 anti-personnel land mine using thermal imagery from drones. It sort of snowballed into quite an intensive research project. It had multiple papers from it, multiple researchers, some awards, and most notably, it beat NASA at a particular Tech Briefs competition. So that was quite a morale boost.
And at some point, Jasper had the idea to integrate AI into the project. Rightfully, he saw the real bottleneck as not the detecting of land mines in drone imagery, but the analysis of land mines in drone imagery. And that really has become— I mean, he knew, somehow, that that would really become the issue that everybody is facing. And everybody we talked to in Ukraine is facing that issue. So machine learning really was the key for solving that problem. And I joined the project in 2018 to integrate machine learning into the research project. We had some more papers, some more presentations, and we were nearing the end of our college tenure, of our undergraduate degree, in 2020. So at that time– but at that time, we realized how much the field needed this. We started getting more and more into the mine action field, and realizing how neglected the field was in terms of technology and innovation. And we felt an obligation to bring our technology, really, to the real world instead of just a research project. There were plenty of research projects about this, but we knew that it could be more and that it should. It really should be more. And we felt we had the– for some reason, we felt like we had the capability to make that happen.
So we formed a nonprofit, the Demining Research Community, in 2020 to try to raise some funding for this project. Our for-profit end of that, of our endeavors, was acquired by a company called Safe Pro Group in 2023. Yeah, 2023, about one year ago exactly. And the drone and AI technology became Safe Pro AI and our flagship product spotlight. And that’s where we’re bringing the technology to the real world. The Demining Research Community is providing resources for other organizations who want to do a similar thing, and is doing more research into more nascent technologies. But yeah, the real drone and AI stuff that’s happening in the real world right now is through Safe Pro.
Strickland: So in that early undergraduate work, you were using thermal sensors. I know now the Spotlight AI system is using more visual. Can you talk about the different modalities of sensing explosives and the sort of trade-offs you get with them?
Steinberg: Sure. So I feel like I should preface this by saying the more high tech and nascent the technology is, the more people want to see it apply to land mine detection. But really, we have found from the problems that people are facing, by far the most effective modality right now is just visual imagery. People have really good visual sensors built into their face, and you don’t need a trained geophysicist to observe the data and very, very quickly get actionable intelligence. There’s also plenty of other benefits. It’s cheaper, much more readily accessible in Ukraine and around the world to get built-in visual sensors on drones. And yeah, just processing the data, and getting the intelligence from the data, is way easier than anything else.
I’ll talk about three different modalities. Well, I guess I could talk about four. There’s thermal, ground penetrating radar, magnetometry, and lidar. So thermal is what we started with. Thermal is really good at detecting living things, as I’m sure most people can surmise. But it’s also pretty good at detecting land mines, mostly large anti-tank land mines buried under a couple millimeters, or up to a couple centimeters, of soil. It’s not super good at this. The research is still not super conclusive, and you have to do it at a very specific time of day, in the morning and at night when, basically the soil around the land mine heats up faster than the land mine and you cause a thermal anomaly, or the sun causes a thermal anomaly. So it can detect things, land mines, in some amount of depth in certain soils, in certain weather conditions, and can only detect certain types of land mines that are big and hefty enough. So yeah, that’s thermal.
Ground penetrating radar is really good for some things. It’s not really great for land mine detection. You have to have really expensive equipment. It takes a really long time to do the surveys. However, it can get plastic land mines under the surface. And it’s kind of the only modality that can do that with reliability. However, you need to train geophysicists to analyze the data. And a lot of the time, the signatures are really non-unique and there’s going to be a lot of false positives. Magnetometry is the other-- by the way, all of this is airborne that I’m referring to. Ground-based GPR and magnetometry are used in demining of various types, but airborne is really what I’m talking about.
For magnetometry, it’s more developed and more capable than ground penetrating radar. It’s used, actually, in the field in Ukraine in some scenarios, but it’s still very expensive. It needs a trained geophysicist to analyze the data, and the signatures are non-unique. So whether it’s a bottle can or a small anti-personnel land mine, you really don’t know until you dig it up. However, I think if I were to bet on one of the other modalities becoming increasingly useful in the next couple of years, it would be airborne magnetometry.
Lidar is another modality that people use. It’s pretty quick, also very expensive, but it can reliably map and find surface anomalies. So if you want to find former fighting positions, sometimes an indicator of that is a trench line or foxholes. Lidar is really good at doing that in conflicts from long ago. So there’s a paper that theHALO Trust published of flyinga lidar mission over former fighting positions, I believe, in Angola. And they reliably found a former trench line. And from that information, they confirmed that as a hazardous area. Because if there is a former front line on this position, you can pretty reliably say that there is going to be some explosives there.
Strickland: And so you’ve done some experiments with some of these modalities, but in the end, you found that the visual sensor was really the best bet for you guys?
Steinberg: Yeah. It’s different. The requirements are different for different scenarios and different locations, really. Ukraine has a lot of surface ordnance. Yeah. And that’s really the main factor that allows visual imagery to be so powerful.
Strickland: So tell me about what role machine learning plays in your Spotlight AI software system. Did you create a model trained on a lot of— did you create a model based on a lot of data showing land mines on the surface?
Steinberg: Yeah. Exactly. We used real-world data from inert, non-explosive items, and flew drone missions over them, and did some physical augmentation and some programmatic augmentation. But all of the items that we are training on are real-life Russian or American ordnance, mostly. We’re also using the real-world data in real minefields that we’re getting from Ukraine right now. That is, obviously, the most valuable data and the most effective in building a machine learning model. But yeah, a lot of our data is from inert explosives, as well.
Strickland: So you’ve talked a little bit about the current situation in Ukraine, but can you tell me more about what people are dealing with there? Are there a lot of areas where the battle has moved on and civilians are trying to reclaim roads or fields?
Steinberg: Yeah. So the fighting is constantly ongoing, obviously, in eastern Ukraine, but I think sometimes there’s a perspective of a stalemate. I think that’s a little misleading. There’s lots of action and violence happening on the front line, which constantly contaminates, cumulatively, the areas that are the front line and the gray zone, as well as areas up to 50 kilometers back from both sides. So there’s constantly artillery shells going into villages and cities along the front line. There’s constantly land mines, new mines, being laid to reinforce the positions. And there’s constantly mortars. And everything is constant. In some fights—I just watched the video yesterday—one of the soldiers said you could not count to five without an explosion going off. And this is just one location in one city along the front. So you can imagine the amount of explosive ordnance that are being fired, and inevitably 10, 20, 30 percent of them are sometimes not exploding upon impact, on top of all the land mines that are being purposely laid and not detonating from a vehicle or a person. These all just remain after the war. They don’t go anywhere. So yeah, Ukraine is really being littered with explosive ordnance and land mines every day.
This past year, there hasn’t been terribly much movement on the front line. But in the Ukrainian counteroffensive in 2020— I guess the last major Ukrainian counteroffensive where areas of Mykolaiv, which is in the southeast, were reclaimed, the civilians started repopulating the city almost immediately. There are definitely some villages that are heavily contaminated, that people just deserted and never came back to, and still haven’t come back to after them being liberated. But a lot of the areas that have been liberated, they’re people’s homes. And even if they’re destroyed, people would rather be in their homes than be refugees. And I mean, I totally understand that. And it just puts the responsibility on the deminers and the Ukrainian government to try to clear the land as fast as possible. Because after large liberations are made, people want to come back almost all the time. So it is a very urgent problem as the lines change and as land is liberated.
Strickland: And I think it was about a year ago that you and Jasper went to the Ukraine for a technology demonstration set up by the United Nations. Can you tell about that, and what the task was, and how your technology fared?
Steinberg: Sure. So yeah, the United Nations Development Program invited us to do a demonstration in northern Ukraine to see how our technology, and other technologies similar to it, performed in a military training facility in Ukraine. So everybody who’s doing this kind of thing, which is not many people, but there are some other organizations, they have their own metrics and their own test fields— not always, but it would be good if they did. But the UNDP said, “No, we want to standardize this and try to give recommendations to the organizations on the ground who are trying to adopt these technologies.” So we had five hours to survey the field and collect as much data as we could. And then we had 72 hours to return the results. We—
Strickland: Sorry. How big was the field?
Steinberg: The field was 25 hectares. So yeah, the audience at home can type 25 hectares to amount of football fields. I think it’s about 60. But it’s a large area. So we’d never done anything like that. That was really, really a shock that it was that large of an area. I think we’d only done half a hectare at a time up to that point. So yeah, it was pretty daunting. But we basically slept very, very little in those 72 hours, and as a result, produced what I think is one of the best results that the UNDP got from that test. We didn’t detect everything, but we detected most of the ordnance and land mines that they had laid. We also detected some that they didn’t know were there because it was a military training facility. So there were some mortars being fired that they didn’t know about.
Strickland: And I think Jasper told me that you had to sort of rewrite your software on the fly. You realized that the existing approach wasn’t going to work and you had to do some all-nighter to recode?
Steinberg: Yeah. Yeah, I remember us sitting in a Georgian restaurant— Georgia, the country, not the state, and racking our brain, trying to figure out how we were going to map this amount of land. We just found out how big the area was going to be and we were a little bit stunned. So we devised a plan to do it in two stages. The first stage was where we figured out in the drone images where the contaminated regions were. And then the second stage was to map those areas, just those areas. Now, our software can actually map the whole thing, and pretty casually too. So not to brag. But at the time, we had lots less development under our belt. And yeah, therefore we just had to brute force it through Georgian food and brainpower.
Strickland: You and Jasper just got back from another trip to the Ukraine a couple of weeks ago, I think. Can you talk about what you were doing on this trip, and who you met with?
Steinberg: Sure. This trip was much less stressful, although stressful in different ways than the UNDP demo. Our main objectives were to see operations in action. We had never actually been to real minefields before. We’d been in some perhaps contaminated areas, but never in a real minefield where you can say, “Here was the Russian position. There are the land mines. Do not go there.” So that was one of the main objectives. That was very powerful for us to see the villages that were destroyed and are denied to the citizens because of land mines and unexploded ordnance. It’s impossible to describe how that feels being there. It’s really impactful, and it makes the work that I’m doing feel not like I have a choice anymore. I feel very much obligated to do my absolute best to help these people.
Strickland: Well, I hope your work continues. I hope there’s less and less need for it over time. But yeah, thank you for doing this. It’s important work. And thanks for joining me on Fixing the Future.
Steinberg: My pleasure. Thank you for having me.
Strickland: That was Gabriel Steinberg speaking to me about the technology that he and Jasper Baur developed to help rid the world of land mines. I’m Eliza Strickland, and I hope you’ll join us next time on Fixing the Future.
Theory of mind—the ability to understand other people’s mental states—is what makes the social world of humans go around. It’s what helps you decide what to say in a tense situation, guess what drivers in other cars are about to do, and empathize with a character in a movie. And according to a new study, the large language models (LLM) that power ChatGPT and the like are surprisingly good at mimicking this quintessentially human trait.
“Before running the study, we were all convinced that large language models would not pass these tests, especially tests that evaluate subtle abilities to evaluate mental states,” says study coauthor Cristina Becchio, a professor of cognitive neuroscience at the University Medical Center Hamburg-Eppendorf in Germany. The results, which she calls “unexpected and surprising,” were published today—somewhat ironically, in the journal Nature Human Behavior.
The results don’t have everyone convinced that we’ve entered a new era of machines that think like we do, however. Two experts who reviewed the findings advised taking them “with a grain of salt” and cautioned about drawing conclusions on a topic that can create “hype and panic in the public.” Another outside expert warned of the dangers of anthropomorphizing software programs.
The researchers are careful not to say that their results show that LLMs actually possess theory of mind.
Becchio and her colleagues aren’t the first to claim evidence that LLMs’ responses display this kind of reasoning. In a preprint paper posted last year, the psychologist Michal Kosinski of Stanford University reported testing several models on a few common theory-of-mind tests. He found that the best of them, OpenAI’s GPT-4, solved 75 percent of tasks correctly, which he said matched the performance of six-year-old children observed in past studies. However, that study’s methods were criticized by other researchers who conducted follow-up experiments and concluded that the LLMs were often getting the right answers based on “shallow heuristics” and shortcuts rather than true theory-of-mind reasoning.
The authors of the present study were well aware of the debate. “Our goal in the paper was to approach the challenge of evaluating machine theory of mind in a more systematic way using a breadth of psychological tests,” says study coauthor James Strachan, a cognitive psychologist who’s currently a visiting scientist at the University Medical Center Hamburg-Eppendorf. He notes that doing a rigorous study meant also testing humans on the same tasks that were given to the LLMs: The study compared the abilities of 1,907 humans with those of several popular LLMs, including OpenAI’s GPT-4 model and the open-source Llama 2-70b model from Meta.
How to Test LLMs for Theory of Mind
The LLMs and the humans both completed five typical kinds of theory-of-mind tasks, the first three of which were understanding hints, irony, and faux pas. They also answered “false belief” questions that are often used to determine if young children have developed theory of mind, and go something like this: If Alice moves something while Bob is out of the room, where will Bob look for it when he returns? Finally, they answered rather complex questions about “strange stories” that feature people lying, manipulating, and misunderstanding each other.
Overall, GPT-4 came out on top. Its scores matched those of humans for the false-belief test, and were higher than the aggregate human scores for irony, hinting, and strange stories; it performed worse than humans only on the faux pas test. Interestingly, Llama-2’s scores were the opposite of GPT-4’s—it matched humans on false belief, but had worse-than-human performance on irony, hinting, and strange stories and better performance on faux pas.
“We don’t currently have a method or even an idea of how to test for the existence of theory of mind.” —James Strachan, University Medical Center Hamburg-Eppendorf
To understand what was going on with the faux pas results, the researchers gave the models a series of follow-up tests that probed several hypotheses. They came to the conclusion that GPT-4 was capable of giving the correct answer to a question about a faux pas, but was held back from doing so by “hyperconservative” programming regarding opinionated statements. Strachan notes that OpenAI has placed many guardrails around its models that are “designed to keep the model factual, honest, and on track,” and he posits that strategies intended to keep GPT-4 from hallucinating (that is, making stuff up) may also prevent it from opining on whether a story character inadvertently insulted an old high school classmate at a reunion.
Meanwhile, the researchers’ follow-up tests for Llama-2 suggested that its excellent performance on the faux pas tests were likely an artifact of the original question and answer format, in which the correct answer to some variant of the question “Did Alice know that she was insulting Bob”? was always “No.”
The researchers are careful not to say that their results show that LLMs actually possess theory of mind, and say instead that they “exhibit behavior that is indistinguishable from human behavior in theory of mind tasks.” Which raises the question: If an imitation is as good as the real thing, how do you know it’s not the real thing? That’s a question social scientists have never tried to answer before, says Strachan, because tests on humans assume that the quality exists to some lesser or greater degree. “We don’t currently have a method or even an idea of how to test for the existence of theory of mind, the phenomenological quality,” he says.
Critiques of the Study
The researchers clearly tried to avoid the methodological problems that caused Kosinski’s 2023 paper on LLMs and theory of mind to come under criticism. For example, they conducted the tests over multiple sessions so the LLMs couldn’t “learn” the correct answers during the test, and they varied the structure of the questions. But Yoav Goldberg and Natalie Shapira, two of the AI researchers who published the critique of the Kosinski paper, say they’re not convinced by this study either.
“Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” —Emily Bender, University of Washington
Goldberg made the comment about taking the findings with a grain of salt, adding that “models are not human beings,” and that “one can easily jump to wrong conclusions” when comparing the two. Shapira spoke about the dangers of hype, and also questions the paper’s methods. She wonders if the models might have seen the test questions in their training data and simply memorized the correct answers, and also notes a potential problem with tests that use paid human participants (in this case, recruited via the Prolific platform). “It is a well-known issue that the workers do not always perform the task optimally,” she tells IEEE Spectrum. She considers the findings limited and somewhat anecdotal, saying, “to prove [theory of mind] capability, a lot of work and more comprehensive benchmarking is needed.”
Emily Bender, a professor of computational linguistics at the University of Washington, has become legendary in the field for her insistence on puncturing the hype that inflates the AI industry (and often also the media reports about that industry). She takes issue with the research question that motivated the researchers. “Why does it matter whether text-manipulation systems can produce output for these tasks that are similar to answers that people give when faced with the same questions?” she asks. “What does that teach us about the internal workings of LLMs, what they might be useful for, or what dangers they might pose?” It’s not clear, Bender says, what it would mean for a LLM to have a model of mind, and it’s therefore also unclear if these tests measured for it.
Bender also raises concerns about the anthropomorphizing she spots in the paper, with the researchers saying that the LLMs are capable of cognition, reasoning, and making choices. She says the authors’ phrase “species-fair comparison between LLMs and human participants” is “entirely inappropriate in reference to software.” Bender and several colleagues recently posted a preprint paper exploring how anthropomorphizing AI systems affects users’ trust.
The results may not indicate that AI really gets us, but it’s worth thinking about the repercussions of LLMs that convincingly mimic theory of mind reasoning. They’ll be better at interacting with their human users and anticipating their needs, but they could also be better used for deceit or the manipulation of their users. And they’ll invite more anthropomorphizing, by convincing human users that there’s a mind on the other side of the user interface.
Each year, the AI Index lands on virtual desks with a louder virtual thud—this year, its 393 pages are a testament to the fact that AI is coming off a really big year in 2023. For the past three years, IEEE Spectrum has read the whole damn thing and pulled out a selection of charts that sum up the current state of AI (see our coverage from 2021, 2022, and 2023).
This year’s report, published by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), has an expanded chapter on responsible AI and new chapters on AI in science and medicine, as well as its usual roundups of R&D, technical performance, the economy, education, policy and governance, diversity, and public opinion. This year is also the first time that Spectrum has figured into the report, with a citation of an article published here about generative AI’s visual plagiarism problem.
1. Generative AI investment skyrockets
While corporate investment was down overall last year, investment in generative AI went through the roof. Nestor Maslej, editor-in-chief of this year’s report, tells Spectrum that the boom is indicative of a broader trend in 2023, as the world grappled with the new capabilities and risks of generative AI systems like ChatGPT and the image-generating DALL-E 2. “The story in the last year has been about people responding [to generative AI],” says Maslej, “whether it’s in policy, whether it’s in public opinion, or whether it’s in industry with a lot more investment.” Another chart in the report shows that most of that private investment in generative AI is happening in the United States.
2. Google is dominating the foundation model race
Foundation models are big multipurpose models—for example, OpenAI’s GPT-3 and GPT-4 are the foundation model that enable ChatGPT users to write code or Shakespearean sonnets. Since training these models typically requires vast resources, Industry now makes most of them, with academia only putting out a few. Companies release foundation models both to push the state-of-the-art forward and to give developers a foundation on which to build products and services. Google released the most in 2023.
3. Closed models outperform open ones
One of the hot debates in AI right now is whether foundation models should be open or closed, with some arguing passionately that open models are dangerous and others maintaining that open models drive innovation. The AI Index doesn’t wade into that debate, but instead looks at trends such as how many open and closed models have been released (another chart, not included here, shows that of the 149 foundation models released in 2023, 98 were open, 23 gave partial access through an API, and 28 were closed).
The chart above reveals another aspect: Closed models outperform open ones on a host of commonly used benchmarks. Maslej says the debate about open versus closed “usually centers around risk concerns, but there’s less discussion about whether there are meaningful performance trade-offs.”
4. Foundation models have gotten super expensive
Here’s why industry is dominating the foundation model scene: Training a big one takes very deep pockets. But exactly how deep? AI companies rarely reveal the expenses involved in training their models, but the AI Index went beyond the typical speculation by collaborating with the AI research organization Epoch AI. To come up with their cost estimates, the report explains, the Epoch team “analyzed training duration, as well as the type, quantity, and utilization rate of the training hardware” using information gleaned from publications, press releases, and technical reports.
It’s interesting to note that Google’s 2017 transformer model, which introduced the architecture that underpins almost all of today’s large language models, was trained for only US $930.
5. And they have a hefty carbon footprint
The AI Index team also estimated the carbon footprint of certain large language models. The report notes that the variance between models is due to factors including model size, data center energy efficiency, and the carbon intensity of energy grids. Another chart in the report (not included here) shows a first guess at emissions related to inference—when a model is doing the work it was trained for—and calls for more disclosures on this topic. As the report notes: “While the per-query emissions of inference may be relatively low, the total impact can surpass that of training when models are queried thousands, if not millions, of times daily.”
6. The United States leads in foundation models
While Maslej says the report isn’t trying to “declare a winner to this race,” he does note that the United States is leading in several categories, including number of foundation models released (above) and number of AI systems deemed significant technical advances. However, he notes that China leads in other categories including AI patents granted and installation of industrial robots.
7. Industry calls new PhDs
This one is hardly a surprise, given the previously discussed data about industry getting lots of investment for generative AI and releasing lots of exciting models. In 2022 (the most recent year for which the Index has data), 70 precent of new AI PhDs in North America took jobs in industry. It’s a continuation of a trend that’s been playing out over the last few years.
8. Some progress on diversity
For years, there’s been little progress on making AI less white and less male. But this year’s report offers a few hopeful signs. For example, the number of non-white and female students taking the AP computer science exam is on the rise. The graph above shows the trends for ethnicity, while another graph, not included here, shows that 30 percent of the students taking the exam are now girls.
Another graph in the report shows that at the undergraduate level, there’s also a positive trend in increasing ethnic diversity among North American students earning bachelor degrees in computer science, although the number of women earning CS bachelor degrees has barely budged over the last five years. Says Maslej, “it’s important to know that there’s still a lot of work to be done here.”
9. Chatter in earnings calls
Businesses are awake to the possibilities of AI. The Index got data about Fortune 500 companies’ earnings calls from Quid, a market intelligence firm that used natural language processing tools to scan for all mentions of “artificial intelligence,” “AI,” “machine learning,” “ML,” and “deep learning.” Nearly 80 percent of the companies included discussion of AI in their calls. “I think there’s a fear in business leaders that if they don’t use this technology, they’re going to miss out,” Maslej says.
And while some of that chatter is likely just CEOs bandying about buzzwords, another graph in the report shows that 55 percent of companies included in a McKinsey survey have implemented AI in at least one business unit.
10. Costs go down, revenues go up
And here’s why AI isn’t just a corporate buzzword: The same McKinsey survey showed that the integration of AI has caused companies’ costs to go down and their revenues go up. Overall, 42 percent of respondents said they’d seen reduced costs, and 59 percent claimed increased revenue.
Other charts in the report suggest that this impact on the bottom line reflects efficiency gains and better worker productivity. In 2023, a number of studies in different fields showed that AI enabled workers to complete tasks more quickly and produce better quality work. One study looked at coders using Copilot, while others looked at consultants, call center agents, and law students. “These studies also show that although every worker benefits, AI helps lower-skilled workers more than it does high-skilled workers,” says Maslej.
11. Corporations do perceive risks
This year, the AI Index team ran a global survey of 1,000 corporations with revenues of at least $500 million to understand how businesses are thinking about responsible AI. The results showed that privacy and data governance is perceived as the greatest risk across the globe, while fairness (often discussed in terms of algorithmic bias) still hasn’t registered with most companies. Another chart in the report shows that companies are taking action on their perceived risks: The majority of organizations across regions have implemented at least one responsible AI measure in response to relevant risks.
12. AI can’t beat humans at everything... yet
In recent years, AI systems have outperformed humans on a range of tasks, including reading comprehension and visual reasoning, and Maslej notes that the pace of AI performance improvement has also picked up. “A decade ago, with a benchmark like ImageNet, you could rely on that to challenge AI researchers for for five or six years,” he says. “Now, a new benchmark is introduced for competition-level mathematics and the AI starts at 30 percent, and then in a year it gets to 90 percent.” While there are still complex cognitive tasks where humans outperform AI systems, let’s check in next year to see how that’s going.
13. Developing norms of AI responsibility
When an AI company is preparing to release a big model, it’s standard practice to test it against popular benchmarks in the field, thus giving the AI community a sense of how models stack up against each other in terms of technical performance. However, it has been less common to test models against responsible AI benchmarks that assess such things as toxic language output (RealToxicityPrompts and ToxiGen), harmful bias in responses (BOLD and BBQ), and a model’s degree of truthfulness (TruthfulQA). That’s starting to change, as there’s a growing sense that checking one’s model against theses benchmarks is, well, the responsible thing to do. However, another chart in the report shows that consistency is lacking: Developers are testing their models against different benchmarks, making comparisons harder.
14. Laws both boost and constrain AI
Between 2016 and 2023, the AI Index found that 33 countries had passed at least one law related to AI, with most of the action occurring in the United States and Europe; in total, 148 AI-related bills have been passed in that timeframe. The Index researchers also classified bills as either expansive laws that aim to enhance a country’s AI capabilities or restrictive laws that place limits on AI applications and usage. While many bills continue to boost AI, the researchers found a global trend toward restrictive legislation.
15. AI makes people nervous
The Index’s public opinion data comes from a global survey on attitudes toward AI, with responses from 22,816 adults (ages 16 to 74) in 31 countries. More than half of respondents said that AI makes them nervous, up from 39 percent the year before. And two-thirds of people now expect AI to profoundly change their daily lives in the next few years.
Maslej notes that other charts in the index show significant differences in opinion among different demographics, with young people being more inclined toward an optimistic view of how AI will change their lives. Interestingly, “a lot of this kind of AI pessimism comes from Western, well-developed nations,” he says, while respondents in places like Indonesia and Thailand said they expect AI’s benefits to outweigh its harms.
Generative AI is today’s buzziest form of artificial intelligence, and it’s what powers chatbots like ChatGPT, Ernie, LLaMA, Claude, and Command—as well as image generators like DALL-E 2, Stable Diffusion, Adobe Firefly, and Midjourney. Generative AI is the branch of AI that enables machines to learn patterns from vast datasets and then to autonomously produce new content based on those patterns. Although generative AI is fairly new, there are already many examples of models that can produce text, images, videos, and audio.
Many “foundation models” have been trained on enough data to be competent in a wide variety of tasks. For example, a large language model can generate essays, computer code, recipes, protein structures, jokes, medical diagnostic advice, and much more. It can also theoretically generate instructions for building a bomb or creating a bioweapon, though safeguards are supposed to prevent such types of misuse.
What’s the difference between AI, machine learning, and generative AI?
Artificial intelligence (AI) refers to a wide variety of computational approaches to mimicking human intelligence.
Machine learning (ML) is a subset of AI; it focuses on algorithms that enable systems to learn from data and improve their performance. Before generative AI came along, most ML models learned from datasets to perform tasks such as classification or prediction. Generative AI is a specialized type of ML involving models that perform the task of generating new content, venturing into the realm of creativity.
What architectures do generative AI models use?
Generative models are built using a variety of neural network architectures—essentially the design and structure that defines how the model is organized and how information flows through it. Some of the most well-known architectures are
variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers. It’s the transformer architecture, first shown in this seminal 2017 paper from Google, that powers today’s large language models. However, the transformer architecture is less suited for other types of generative AI, such as image and audio generation.
Autoencoders learn efficient representations of data through an
encoder-decoder framework. The encoder compresses input data into a lower-dimensional space, known as the latent (or embedding) space, that preserves the most essential aspects of the data. A decoder can then use this compressed representation to reconstruct the original data. Once an autoencoder has been trained in this way, it can use novel inputs to generate what it considers the appropriate outputs. These models are often deployed in image-generation tools and have also found use in drug discovery, where they can be used to generate new molecules with desired properties.
With generative adversarial networks (GANs), the training involves a
generator and a discriminator that can be considered adversaries. The generator strives to create realistic data, while the discriminator aims to distinguish between those generated outputs and real “ground truth” outputs. Every time the discriminator catches a generated output, the generator uses that feedback to try to improve the quality of its outputs. But the discriminator also receives feedback on its performance. This adversarial interplay results in the refinement of both components, leading to the generation of increasingly authentic-seeming content. GANs are best known for creating deepfakes but can also be used for more benign forms of image generation and many other applications.
The transformer is arguably the reigning champion of generative AI architectures for its ubiquity in today’s powerful large language models (LLMs). Its strength lies in its attention mechanism, which enables the model to focus on different parts of an input sequence while making predictions. In the case of language models, the input consists of strings of words that make up sentences, and the transformer predicts what words will come next (we’ll get into the details below). In addition, transformers can process all the elements of a sequence in parallel rather than marching through it from beginning to end, as earlier types of models did; this
parallelization makes training faster and more efficient. When developers added vast datasets of text for transformer models to learn from, today’s remarkable chatbots emerged.
How do large language models work?
A transformer-based LLM is trained by giving it a vast dataset of text to learn from. The attention mechanism comes into play as it processes sentences and looks for patterns. By looking at all the words in a sentence at once, it gradually begins to understand which words are most commonly found together and which words are most important to the meaning of the sentence. It learns these things by trying to predict the next word in a sentence and comparing its guess to the ground truth. Its errors act as feedback signals that cause the model to adjust the weights it assigns to various words before it tries again.
These five LLMs vary greatly in size (given in parameters), and the larger models have better performance on a standard LLM benchmark test. IEEE Spectrum
To explain the training process in slightly more technical terms, the text in the training data is broken down into elements called
tokens, which are words or pieces of words—but for simplicity’s sake, let’s say all tokens are words. As the model goes through the sentences in its training data and learns the relationships between tokens, it creates a list of numbers, called a vector, for each one. All the numbers in the vector represent various aspects of the word: its semantic meanings, its relationship to other words, its frequency of use, and so on. Similar words, like elegant and fancy, will have similar vectors and will also be near each other in the vector space. These vectors are called word embeddings. The parameters of an LLM include the weights associated with all the word embeddings and the attention mechanism. GPT-4, the OpenAI model that’s considered the current champion, is rumored to have more than 1 trillion parameters.
Given enough data and training time, the LLM begins to understand the subtleties of language. While much of the training involves looking at text sentence by sentence, the attention mechanism also captures relationships between words throughout a longer text sequence of many paragraphs. Once an LLM is trained and is ready for use, the attention mechanism is still in play. When the model is generating text in response to a prompt, it’s using its predictive powers to decide what the next word should be. When generating longer pieces of text, it predicts the next word in the context of all the words it has written so far; this function increases the coherence and continuity of its writing.
Why do large language models hallucinate?
You may have heard that LLMs sometimes “hallucinate.” That’s a polite way to say they make stuff up very convincingly. A model sometimes generates text that fits the context and is grammatically correct, yet the material is erroneous or nonsensical. This bad habit stems from LLMs training on vast troves of data drawn from the Internet, plenty of which is not factually accurate. Since the model is simply trying to predict the next word in a sequence based on what it has seen, it may generate plausible-sounding text that has no grounding in reality.
Why is generative AI controversial?
One source of controversy for generative AI is the provenance of its training data. Most AI companies that train large models to generate text, images, video, and audio have
not been transparent about the content of their training datasets. Various leaks and experiments have revealed that those datasets include copyrighted material such as books, newspaper articles, and movies. A number of lawsuits are underway to determine whether use of copyrighted material for training AI systems constitutes fair use, or whether the AI companies need to pay the copyright holders for use of their material.
On a related note, many people are concerned that the widespread use of generative AI will take jobs away from creative humans who make art, music, written works, and so forth. People are also concerned that it could take jobs from humans who do a wide range of white-collar jobs, including translators, paralegals, customer-service representatives, and journalists. There have already been a few
troubling layoffs, but it’s hard to say yet whether generative AI will be reliable enough for large-scale enterprise applications. (See above about hallucinations.)
Finally, there’s the danger that generative AI will be used to make bad stuff. And there are of course many categories of bad stuff it could theoretically be used for. Generative AI can be used for personalized scams and phishing attacks: For example, using “voice cloning,” scammers can
copy the voice of a specific person and call the person’s family with a plea for help (and money). All formats of generative AI—text, audio, image, and video—can be used to generate misinformation by creating plausible-seeming representations of things that never happened, which is a particularly worrying possibility when it comes to elections. (Meanwhile, as IEEE Spectrum reported this week, the U.S. Federal Communications Commission has responded by outlawing AI-generated robocalls.) Image- and video-generating tools can be used to produce nonconsensual pornography, although the tools made by mainstream companies disallow such use. And chatbots can theoretically walk a would-be terrorist through the steps of making a bomb, nerve gas, and a host of other horrors. Although the big LLMs have safeguards to prevent such misuse, some hackers delight in circumventing those safeguards. What’s more, “uncensored” versions of open-source LLMs are out there.
Despite such potential problems, many people think that generative AI can also make people more productive and could be used as a tool to enable entirely new forms of creativity. We’ll likely see both disasters and creative flowerings and plenty else that we don’t expect. But knowing the basics of how these models work is increasingly crucial for tech-savvy people today. Because no matter how sophisticated these systems grow, it’s the humans’ job to keep them running, make the next ones better, and with any luck, help people out too.
Andrew Ng has serious street cred in artificial intelligence. He pioneered the use of graphics processing units (GPUs) to train deep learning models in the late 2000s with his students at Stanford University, cofounded Google Brain in 2011, and then served for three years as chief scientist for Baidu, where he helped build the Chinese tech giant’s AI group. So when he says he has identified the next big shift in artificial intelligence, people listen. And that’s what he told IEEE Spectrum in an exclusive Q&A.
Ng’s current efforts are focused on his company
Landing AI, which built a platform called LandingLens to help manufacturers improve visual inspection with computer vision. He has also become something of an evangelist for what he calls the data-centric AI movement, which he says can yield “small data” solutions to big issues in AI, including model efficiency, accuracy, and bias.
The great advances in deep learning over the past decade or so have been powered by ever-bigger models crunching ever-bigger amounts of data. Some people argue that that’s an unsustainable trajectory. Do you agree that it can’t go on that way?
Andrew Ng: This is a big question. We’ve seen foundation models in NLP [natural language processing]. I’m excited about NLP models getting even bigger, and also about the potential of building foundation models in computer vision. I think there’s lots of signal to still be exploited in video: We have not been able to build foundation models yet for video because of compute bandwidth and the cost of processing video, as opposed to tokenized text. So I think that this engine of scaling up deep learning algorithms, which has been running for something like 15 years now, still has steam in it. Having said that, it only applies to certain problems, and there’s a set of other problems that need small data solutions.
When you say you want a foundation model for computer vision, what do you mean by that?
Ng: This is a term coined by Percy Liang and some of my friends at Stanford to refer to very large models, trained on very large data sets, that can be tuned for specific applications. For example, GPT-3 is an example of a foundation model [for NLP]. Foundation models offer a lot of promise as a new paradigm in developing machine learning applications, but also challenges in terms of making sure that they’re reasonably fair and free from bias, especially if many of us will be building on top of them.
What needs to happen for someone to build a foundation model for video?
Ng: I think there is a scalability problem. The compute power needed to process the large volume of images for video is significant, and I think that’s why foundation models have arisen first in NLP. Many researchers are working on this, and I think we’re seeing early signs of such models being developed in computer vision. But I’m confident that if a semiconductor maker gave us 10 times more processor power, we could easily find 10 times more video to build such models for vision.
Having said that, a lot of what’s happened over the past decade is that deep learning has happened in consumer-facing companies that have large user bases, sometimes billions of users, and therefore very large data sets. While that paradigm of machine learning has driven a lot of economic value in consumer software, I find that that recipe of scale doesn’t work for other industries.
It’s funny to hear you say that, because your early work was at a consumer-facing company with millions of users.
Ng: Over a decade ago, when I proposed starting the Google Brain project to use Google’s compute infrastructure to build very large neural networks, it was a controversial step. One very senior person pulled me aside and warned me that starting Google Brain would be bad for my career. I think he felt that the action couldn’t just be in scaling up, and that I should instead focus on architecture innovation.
“In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.”
—Andrew Ng, CEO & Founder, Landing AI
I remember when my students and I published the first
NeurIPS workshop paper advocating using CUDA, a platform for processing on GPUs, for deep learning—a different senior person in AI sat me down and said, “CUDA is really complicated to program. As a programming paradigm, this seems like too much work.” I did manage to convince him; the other person I did not convince.
I expect they’re both convinced now.
Ng: I think so, yes.
Over the past year as I’ve been speaking to people about the data-centric AI movement, I’ve been getting flashbacks to when I was speaking to people about deep learning and scalability 10 or 15 years ago. In the past year, I’ve been getting the same mix of “there’s nothing new here” and “this seems like the wrong direction.”
How do you define data-centric AI, and why do you consider it a movement?
Ng: Data-centric AI is the discipline of systematically engineering the data needed to successfully build an AI system. For an AI system, you have to implement some algorithm, say a neural network, in code and then train it on your data set. The dominant paradigm over the last decade was to download the data set while you focus on improving the code. Thanks to that paradigm, over the last decade deep learning networks have improved significantly, to the point where for a lot of applications the code—the neural network architecture—is basically a solved problem. So for many practical applications, it’s now more productive to hold the neural network architecture fixed, and instead find ways to improve the data.
When I started speaking about this, there were many practitioners who, completely appropriately, raised their hands and said, “Yes, we’ve been doing this for 20 years.” This is the time to take the things that some individuals have been doing intuitively and make it a systematic engineering discipline.
The data-centric AI movement is much bigger than one company or group of researchers. My collaborators and I organized a
data-centric AI workshop at NeurIPS, and I was really delighted at the number of authors and presenters that showed up.
You often talk about companies or institutions that have only a small amount of data to work with. How can data-centric AI help them?
Ng: You hear a lot about vision systems built with millions of images—I once built a face recognition system using 350 million images. Architectures built for hundreds of millions of images don’t work with only 50 images. But it turns out, if you have 50 really good examples, you can build something valuable, like a defect-inspection system. In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.
When you talk about training a model with just 50 images, does that really mean you’re taking an existing model that was trained on a very large data set and fine-tuning it? Or do you mean a brand new model that’s designed to learn only from that small data set?
Ng: Let me describe what Landing AI does. When doing visual inspection for manufacturers, we often use our own flavor of RetinaNet. It is a pretrained model. Having said that, the pretraining is a small piece of the puzzle. What’s a bigger piece of the puzzle is providing tools that enable the manufacturer to pick the right set of images [to use for fine-tuning] and label them in a consistent way. There’s a very practical problem we’ve seen spanning vision, NLP, and speech, where even human annotators don’t agree on the appropriate label. For big data applications, the common response has been: If the data is noisy, let’s just get a lot of data and the algorithm will average over it. But if you can develop tools that flag where the data’s inconsistent and give you a very targeted way to improve the consistency of the data, that turns out to be a more efficient way to get a high-performing system.
“Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.”
—Andrew Ng
For example, if you have 10,000 images where 30 images are of one class, and those 30 images are labeled inconsistently, one of the things we do is build tools to draw your attention to the subset of data that’s inconsistent. So you can very quickly relabel those images to be more consistent, and this leads to improvement in performance.
Could this focus on high-quality data help with bias in data sets? If you’re able to curate the data more before training?
Ng: Very much so. Many researchers have pointed out that biased data is one factor among many leading to biased systems. There have been many thoughtful efforts to engineer the data. At the NeurIPS workshop, Olga Russakovsky gave a really nice talk on this. At the main NeurIPS conference, I also really enjoyed Mary Gray’s presentation, which touched on how data-centric AI is one piece of the solution, but not the entire solution. New tools like Datasheets for Datasets also seem like an important piece of the puzzle.
One of the powerful tools that data-centric AI gives us is the ability to engineer a subset of the data. Imagine training a machine-learning system and finding that its performance is okay for most of the data set, but its performance is biased for just a subset of the data. If you try to change the whole neural network architecture to improve the performance on just that subset, it’s quite difficult. But if you can engineer a subset of the data you can address the problem in a much more targeted way.
When you talk about engineering the data, what do you mean exactly?
Ng: In AI, data cleaning is important, but the way the data has been cleaned has often been in very manual ways. In computer vision, someone may visualize images through a Jupyter notebook and maybe spot the problem, and maybe fix it. But I’m excited about tools that allow you to have a very large data set, tools that draw your attention quickly and efficiently to the subset of data where, say, the labels are noisy. Or to quickly bring your attention to the one class among 100 classes where it would benefit you to collect more data. Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.
For example, I once figured out that a speech-recognition system was performing poorly when there was car noise in the background. Knowing that allowed me to collect more data with car noise in the background, rather than trying to collect more data for everything, which would have been expensive and slow.
What about using synthetic data, is that often a good solution?
Ng: I think synthetic data is an important tool in the tool chest of data-centric AI. At the NeurIPS workshop, Anima Anandkumar gave a great talk that touched on synthetic data. I think there are important uses of synthetic data that go beyond just being a preprocessing step for increasing the data set for a learning algorithm. I’d love to see more tools to let developers use synthetic data generation as part of the closed loop of iterative machine learning development.
Do you mean that synthetic data would allow you to try the model on more data sets?
Ng: Not really. Here’s an example. Let’s say you’re trying to detect defects in a smartphone casing. There are many different types of defects on smartphones. It could be a scratch, a dent, pit marks, discoloration of the material, other types of blemishes. If you train the model and then find through error analysis that it’s doing well overall but it’s performing poorly on pit marks, then synthetic data generation allows you to address the problem in a more targeted way. You could generate more data just for the pit-mark category.
“In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models.”
—Andrew Ng
Synthetic data generation is a very powerful tool, but there are many simpler tools that I will often try first. Such as data augmentation, improving labeling consistency, or just asking a factory to collect more data.
To make these issues more concrete, can you walk me through an example? When a company approaches Landing AI and says it has a problem with visual inspection, how do you onboard them and work toward deployment?
Ng: When a customer approaches us we usually have a conversation about their inspection problem and look at a few images to verify that the problem is feasible with computer vision. Assuming it is, we ask them to upload the data to the LandingLens platform. We often advise them on the methodology of data-centric AI and help them label the data.
One of the foci of Landing AI is to empower manufacturing companies to do the machine learning work themselves. A lot of our work is making sure the software is fast and easy to use. Through the iterative process of machine learning development, we advise customers on things like how to train models on the platform, when and how to improve the labeling of data so the performance of the model improves. Our training and software supports them all the way through deploying the trained model to an edge device in the factory.
How do you deal with changing needs? If products change or lighting conditions change in the factory, can the model keep up?
Ng: It varies by manufacturer. There is data drift in many contexts. But there are some manufacturers that have been running the same manufacturing line for 20 years now with few changes, so they don’t expect changes in the next five years. Those stable environments make things easier. For other manufacturers, we provide tools to flag when there’s a significant data-drift issue. I find it really important to empower manufacturing customers to correct data, retrain, and update the model. Because if something changes and it’s 3 a.m. in the United States, I want them to be able to adapt their learning algorithm right away to maintain operations.
In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models. The challenge is, how do you do that without Landing AI having to hire 10,000 machine learning specialists?
So you’re saying that to make it scale, you have to empower customers to do a lot of the training and other work.
Ng: Yes, exactly! This is an industry-wide problem in AI, not just in manufacturing. Look at health care. Every hospital has its own slightly different format for electronic health records. How can every hospital train its own custom AI model? Expecting every hospital’s IT personnel to invent new neural-network architectures is unrealistic. The only way out of this dilemma is to build tools that empower the customers to build their own models by giving them tools to engineer the data and express their domain knowledge. That’s what Landing AI is executing in computer vision, and the field of AI needs other teams to execute this in other domains.
Is there anything else you think it’s important for people to understand about the work you’re doing or the data-centric AI movement?
Ng: In the last decade, the biggest shift in AI was a shift to deep learning. I think it’s quite possible that in this decade the biggest shift will be to data-centric AI. With the maturity of today’s neural network architectures, I think for a lot of the practical applications the bottleneck will be whether we can efficiently get the data we need to develop systems that work well. The data-centric AI movement has tremendous energy and momentum across the whole community. I hope more researchers and developers will jump in and work on it.