Tech companies have been caught up in a race to build the biggest large language models (LLMs). In April, for example, Meta announced the 400-billion-parameter Llama 3, which contains twice the number of parameters—or variables that determine how the model responds to queries—than OpenAI’s original ChatGPT model from 2022. Although not confirmed, GPT-4 is estimated to have about 1.8 trillion parameters.In the last few months, however, some of the largest tech companies, including Apple and Micro
Tech companies have been caught up in a race to build the biggest large language models (LLMs). In April, for example, Meta announced the 400-billion-parameter Llama 3, which contains twice the number of parameters—or variables that determine how the model responds to queries—than OpenAI’s original ChatGPT model from 2022. Although not confirmed, GPT-4 is estimated to have about 1.8 trillion parameters.
In the last few months, however, some of the largest tech companies, including Apple and Microsoft, have introduced small language models (SLMs). These models are a fraction of the size of their LLM counterparts and yet, on many benchmarks, can match or even outperform them in text generation.
On 10 June, at Apple’s Worldwide Developers Conference, the company announced its “Apple Intelligence” models, which have around 3 billion parameters. And in late April, Microsoft released its Phi-3 family of SLMs, featuring models housing between 3.8 billion and 14 billion parameters.
OpenAI’s CEO Sam Altman believes we’re at the end of the era of giant models.
In a series of tests, the smallest of Microsoft’s models, Phi-3-mini, rivalled OpenAI’s GPT-3.5 (175 billion parameters), which powers the free version of ChatGPT, and outperformed Google’s Gemma (7 billion parameters). The tests evaluated how well a model understands language by prompting it with questions about mathematics, philosophy, law, and more. What’s more interesting, Microsoft’s Phi-3-small, with 7 billion parameters, fared remarkably better than GPT-3.5 in many of these benchmarks.
Aaron Mueller, who researches language models at Northeastern University in Boston, isn’t surprised SLMs can go toe-to-toe with LLMs in select functions. He says that’s because scaling the number of parameters isn’t the only way to improve a model’s performance: Training it on higher-quality data can yield similar results too.
Microsoft’s Phi models were trained on fine-tuned “textbook-quality” data, says Mueller, which have a more consistent style that’s easier to learn from than the highly diverse text from across the Internet that LLMs typically rely on. Similarly, Apple trained its SLMs exclusively on richer and more complex datasets.
The rise of SLMs comes at a time when the performance gap between LLMs is quickly narrowing and tech companies look to deviate from standard scaling laws and explore other avenues for performance upgrades. At an event in April, OpenAI’s CEO Sam Altman said he believes we’re at the end of the era of giant models. “We’ll make them better in other ways.”
Because SLMs don’t consume nearly as much energy as LLMs, they can also run locally on devices like smartphones and laptops (instead of in the cloud) to preserve data privacy and personalize them to each person. In March, Google rolled out Gemini Nano to the company’s Pixel line of smartphones. The SLM can summarize audio recordings and produce smart replies to conversations without an Internet connection. Apple is expected to follow suit later this year.
More importantly, SLMs can democratize access to language models, says Mueller. So far, AI development has been concentrated into the hands of a couple of large companies that can afford to deploy high-end infrastructure, while other, smaller operations and labs have been forced to license them for hefty fees.
Since SLMs can be easily trained on more affordable hardware, says Mueller, they’re more accessible to those with modest resources and yet still capable enough for specific applications.
In addition, while researchers agree there’s still a lot of work ahead to overcome hallucinations, carefully curated SLMs bring them a step closer toward building responsible AI that is also interpretable, which would potentially allow researchers to debug specific LLM issues and fix them at the source.
For researchers like Alex Warstadt, a computer science researcher at ETH Zurich, SLMs could also offer new, fascinating insights into a longstanding scientific question: How children acquire their first language. Warstadt, alongside a group of researchers including Northeastern’s Mueller, organizes BabyLM, a challenge in which participants optimize language-model training on small data.
Not only could SLMs potentially unlock new secrets of human cognition, but they also help improve generative AI. By the time children turn 13, they’re exposed to about 100 million words and are better than chatbots at language, with access to only 0.01 percent of the data. While no one knows what makes humans so much more efficient, says Warstadt, “reverse engineering efficient humanlike learning at small scales could lead to huge improvements when scaled up to LLM scales.”
Ever wondered what it's like to train AI models? Sounds cutting-edge and cool, maybe? Seems like something that might be interesting and where you might learn some helpful new skills, right? Well, according to some people who have recently done this work for one of the biggest AI companies in the world, the work of training AI is chaotic and inconsistent at best. — Read the rest
The post Essays explore the hellscape of freelance AI model training appeared first on Boing Boing.
Ever wondered what it's like to train AI models? Sounds cutting-edge and cool, maybe? Seems like something that might be interesting and where you might learn some helpful new skills, right? Well, according to some people who have recently done this work for one of the biggest AI companies in the world, the work of training AI is chaotic and inconsistent at best. — Read the rest
Since ChatGPT dropped in the fall of 2022, everyone and their donkey has tried their hand at prompt engineering—finding a clever way to phrase your query to a large language model (LLM) or AI art or video generator to get the best results or sidestep protections. The Internet is replete with prompt-engineering guides, cheat sheets, and advice threads to help you get the most out of an LLM.In the commercial sector, companies are now wrangling LLMs to build product copilots, automate tedious work,
In the commercial sector, companies are now wrangling LLMs to build product copilots, automate tedious work, create personal assistants, and more, says Austin Henley, a former Microsoft employee who conducted a series of interviews with people developing LLM-powered copilots. “Every business is trying to use it for virtually every use case that they can imagine,” Henley says.
“The only real trend may be no trend. What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.” —Rick Battle & Teja Gollapudi, VMware
To do so, they’ve enlisted the help of prompt engineers professionally.
However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.
Autotuned prompts are successful and strange
Rick Battle and Teja Gollapudi at California-based cloud computing company VMware were perplexed by how finicky and unpredictable LLM performance was in response to weird prompting techniques. For example, people have found that asking models to explain its reasoning step-by-step—a technique called chain-of-thought—improved their performance on a range of math and logic questions. Even weirder, Battle found that giving a model positive prompts, such as “this will be fun” or “you are as smart as chatGPT,” sometimes improved performance.
Battle and Gollapudi decided to systematically test how different prompt-engineering strategies impact an LLM’s ability to solve grade-school math questions. They tested three different open-source language models with 60 different prompt combinations each. What they found was a surprising lack of consistency. Even chain-of-thought prompting sometimes helped and other times hurt performance. “The only real trend may be no trend,” they write. “What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”
According to one research team, no human should manually optimize prompts ever again.
There is an alternative to the trial-and-error-style prompt engineering that yielded such inconsistent results: Ask the language model to devise its own optimal prompt. Recently, new tools have been developed to automate this process. Given a few examples and a quantitative success metric, these tools will iteratively find the optimal phrase to feed into the LLM. Battle and his collaborators found that in almost every case, this automatically generated prompt did better than the best prompt found through trial-and-error. And, the process was much faster, a couple of hours rather than several days of searching.
The optimal prompts the algorithm spit out were so bizarre, no human is likely to have ever come up with them. “I literally could not believe some of the stuff that it generated,” Battle says. In one instance, the prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.” Apparently, thinking it was Captain Kirk helped this particular LLM do better on grade-school math questions.
Battle says that optimizing the prompts algorithmically fundamentally makes sense given what language models really are—models. “A lot of people anthropomorphize these things because they ‘speak English.’ No, they don’t,” Battle says. “It doesn’t speak English. It does a lot of math.”
In fact, in light of his team’s results, Battle says no human should manually optimize prompts ever again.
“You’re just sitting there trying to figure out what special magic combination of words will give you the best possible performance for your task,” Battle says, “But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”
Autotuned prompts make pictures prettier, too
Image-generation algorithms can benefit from automatically generated prompts as well. Recently, a team at Intel labs, led by Vasudev Lal, set out on a similar quest to optimize prompts for the image-generation model Stable Diffusion. “It seems more like a bug of LLMs and diffusion models, not a feature, that you have to do this expert prompt engineering,” Lal says. “So, we wanted to see if we can automate this kind of prompt engineering.”
“Now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.” —Vasudev Lal, Intel Labs
Lal’s team created a tool called NeuroPrompts that takes a simple input prompt, such as “boy on a horse,” and automatically enhances it to produce a better picture. To do this, they started with a range of prompts generated by human prompt-engineering experts. They then trained a language model to transform simple prompts into these expert-level prompts. On top of that, they used reinforcement learning to optimize these prompts to create more aesthetically pleasing images, as rated by yet another machine-learning model, PickScore, a recently developed image-evaluation tool.
NeuroPrompts is a generative AI auto prompt tuner that transforms simple prompts into more detailed and visually stunning StableDiffusion results—as in this case, an image generated by a generic prompt [left] versus its equivalent NeuroPrompt-generated image.Intel Labs/Stable Diffusion
Here too, the automatically generated prompts did better than the expert-human prompts they used as a starting point, at least according to the PickScore metric. Lal found this unsurprising. “Humans will only do it with trial and error,” Lal says. “But now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”
Since aesthetic quality is infamously subjective, Lal and his team wanted to give the user some control over how the prompt was optimized. In their tool, the user can specify the original prompt (say, “boy on a horse”) as well as an artist to emulate, a style, a format, and other modifiers.
Lal believes that as generative AI models evolve, be it image generators or large language models, the weird quirks of prompt dependence should go away. “I think it’s important that these kinds of optimizations are investigated and then ultimately, they’re really incorporated into the base model itself so that you don’t really need a complicated prompt-engineering step.”
Prompt engineering will live on, by some name
Even if autotuning prompts becomes the industry norm, prompt-engineering jobs in some form are not going away, says Tim Cramer, senior vice president of software engineering at Red Hat. Adapting generative AI for industry needs is a complicated, multistage endeavor that will continue requiring humans in the loop for the foreseeable future.
“Maybe we’re calling them prompt engineers today. But I think the nature of that interaction will just keep on changing as AI models also keep changing.” —Vasudev Lal, Intel Labs
“I think there are going to be prompt engineers for quite some time, and data scientists,” Cramer says. “It’s not just asking questions of the LLM and making sure that the answer looks good. But there’s a raft of things that prompt engineers really need to be able to do.”
“It’s very easy to make a prototype,” Henley says. “It’s very hard to production-ize it.” Prompt engineering seems like a big piece of the puzzle when you’re building a prototype, Henley says, but many other considerations come into play when you’re making a commercial-grade product.
Challenges of making a commercial product include ensuring reliability—for example, failing gracefully when the model goes offline; adapting the model’s output to the appropriate format, since many use cases require outputs other than text; testing to make sure the AI assistant won’t do something harmful in even a small number of cases; and ensuring safety, privacy, and compliance. Testing and compliance are particularly difficult, Henley says, as traditional software-development testing strategies are maladapted for nondeterministic LLMs.
To fulfill these myriad tasks, many large companies are heralding a new job title: Large Language Model Operations, or LLMOps, which includes prompt engineering in its life cycle but also entails all the other tasks needed to deploy the product. Henley says LLMOps’ predecessors, machine learning operations (MLOps) engineers, are best positioned to take on these jobs.
Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or something new entirely, the nature of the job will continue evolving quickly. “Maybe we’re calling them prompt engineers today,” Lal says, “But I think the nature of that interaction will just keep on changing as AI models also keep changing.”
“I don’t know if we’re going to combine it with another sort of job category or job role,” Cramer says, “But I don’t think that these things are going to be going away anytime soon. And the landscape is just too crazy right now. Everything’s changing so much. We’re not going to figure it all out in a few months.”
Henley says that, to some extent in this early phase of the field, the only overriding rule seems to be the absence of rules. “It’s kind of the Wild, Wild West for this right now.” he says.
Enlarge (credit: Google)
On Wednesday, Google announced a new family of AI language models called Gemma, which are free, open-weights models built on technology similar to the more powerful but closed Gemini models. Unlike Gemini, Gemma models can run locally on a desktop or laptop computer. It's Google's first significant open large language model (LLM) release since OpenAI's ChatGPT started a frenzy for AI chatbots in 2022.
Gemma models come in two sizes: Gemma 2B (2 billi
On Wednesday, Google announced a new family of AI language models called Gemma, which are free, open-weights models built on technology similar to the more powerful but closed Gemini models. Unlike Gemini, Gemma models can run locally on a desktop or laptop computer. It's Google's first significant open large language model (LLM) release since OpenAI's ChatGPT started a frenzy for AI chatbots in 2022.
Gemma models come in two sizes: Gemma 2B (2 billion parameters) and Gemma 7B (7 billion parameters), each available in pre-trained and instruction-tuned variants. In AI, parameters are values in a neural network that determine AI model behavior, and weights are a subset of these parameters stored in a file.
Developed by Google DeepMind and other Google AI teams, Gemma pulls from techniques learned during the development of Gemini, which is the family name for Google's most capable (public-facing) commercial LLMs, including the ones that power its Gemini AI assistant. Google says the name comes from the Latin gemma, which means "precious stone."
A technical paper titled “Low Power and Temperature-Resilient Compute-In-Memory Based on Subthreshold-FeFET” was published by researchers at Zhejiang University, University of Notre Dame, Technical University of Munich, Munich Institute of Robotics and Machine Intelligence, and the Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province.
Abstract:
“Compute-in-memory (CiM) is a promising solution for addressing the challenges of artificial intelligence (AI) and th
A technical paper titled “Low Power and Temperature-Resilient Compute-In-Memory Based on Subthreshold-FeFET” was published by researchers at Zhejiang University, University of Notre Dame, Technical University of Munich, Munich Institute of Robotics and Machine Intelligence, and the Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province.
Abstract:
“Compute-in-memory (CiM) is a promising solution for addressing the challenges of artificial intelligence (AI) and the Internet of Things (IoT) hardware such as ‘memory wall’ issue. Specifically, CiM employing nonvolatile memory (NVM) devices in a crossbar structure can efficiently accelerate multiply-accumulation (MAC) computation, a crucial operator in neural networks among various AI models. Low power CiM designs are thus highly desired for further energy efficiency optimization on AI models. Ferroelectric FET (FeFET), an emerging device, is attractive for building ultra-low power CiM array due to CMOS compatibility, high ION /IOF ratio, etc. Recent studies have explored FeFET based CiM designs that achieve low power consumption. Nevertheless, subthreshold-operated FeFETs, where the operating voltages are scaled down to the subthreshold region to reduce array power consumption, are particularly vulnerable to temperature drift, leading to accuracy degradation. To address this challenge, we propose a temperature-resilient 2T-1FeFET CiM design that performs MAC operations reliably at subthreahold region from 0 to 85 Celsius, while consuming ultra-low power. Benchmarked against the VGG neural network architecture running the CIFAR-10 dataset, the proposed 2T-1FeFET CiM design achieves 89.45% CIFAR-10 test accuracy. Compared to previous FeFET based CiM designs, it exhibits immunity to temperature drift at an 8-bit wordlength scale, and achieves better energy efficiency with 2866 TOPS/W.”
Zhou, Yifei, Xuchu Huang, Jianyi Yang, Kai Ni, Hussam Amrouch, Cheng Zhuo, and Xunxhao Yin. “Low Power and Temperature-Resilient Compute-In-Memory Based on Subthreshold-FeFET.” arXiv preprint arXiv:2312.17442 (2023).