IEEE Spectrum
1-bit LLMs Could Solve AI’s Energy DemandsMatthew Hutson
Large language models, the AI systems that power chatbots like ChatGPT, are getting better and better—but they’re also getting bigger and bigger, demanding more energy and computational power. For LLMs that are cheap, fast, and environmentally friendly, they’ll need to shrink, ideally small enough to run directly on devices like cellphones. Researchers are finding ways to do just that by drastically rounding off the many high-precision numbers that store their memories to equal just 1 or -1.LLMs
30. Květen 2024 v 20:28

1-bit LLMs Could Solve AI’s Energy Demands

Od: Matthew Hutson

30. Květen 2024 v 20:28

Large language models, the AI systems that power chatbots like ChatGPT, are getting better and better—but they’re also getting bigger and bigger, demanding more energy and computational power. For LLMs that are cheap, fast, and environmentally friendly, they’ll need to shrink, ideally small enough to run directly on devices like cellphones. Researchers are finding ways to do just that by drastically rounding off the many high-precision numbers that store their memories to equal just 1 or -1.

LLMs, like all neural networks, are trained by altering the strengths of connections between their artificial neurons. These strengths are stored as mathematical parameters. Researchers have long compressed networks by reducing the precision of these parameters—a process called quantization—so that instead of taking up 16 bits each, they might take up 8 or 4. Now researchers are pushing the envelope to a single bit.

How to Make a 1-bit LLM

There are two general approaches. One approach, called post-training quantization (PTQ) is to quantize the parameters of a full-precision network. The other approach, quantization-aware training (QAT), is to train a network from scratch to have low-precision parameters. So far, PTQ has been more popular with researchers.

In February, a team including Haotong Qin at ETH Zurich, Xianglong Liu at Beihang University, and Wei Huang at the University of Hong Kong introduced a PTQ method called BiLLM. It approximates most parameters in a network using 1 bit, but represents a few salient weights—those most influential to performance—using 2 bits. In one test, the team binarized a version of Meta’s LLaMa LLM that has 13 billion parameters.

“One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs.” —Furu Wei, Microsoft Research Asia

To score performance, the researchers used a metric called perplexity, which is basically a measure of how surprised the trained model was by each ensuing piece of text. For one dataset, the original model had a perplexity of around 5, and the BiLLM version scored around 15, much better than the closest binarization competitor, which scored around 37 (for perplexity, lower numbers are better). That said, the BiLLM model required about a tenth of the memory capacity as the original.

PTQ has several advantages over QAT, says Wanxiang Che, a computer scientist at Harbin Institute of Technology, in China. It doesn’t require collecting training data, it doesn’t require training a model from scratch, and the training process is more stable. QAT, on the other hand, has the potential to make models more accurate, since quantization is built into the model from the beginning.

1-bit LLMs Find Success Against Their Larger Cousins

Last year, a team led by Furu Wei and Shuming Ma, at Microsoft Research Asia, in Beijing, created BitNet, the first 1-bit QAT method for LLMs. After fiddling with the rate at which the network adjusts its parameters, in order to stabilize training, they created LLMs that performed better than those created using PTQ methods. They were still not as good as full-precision networks, but roughly 10 times as energy efficient.

In February, Wei’s team announced BitNet 1.58b, in which parameters can equal -1, 0, or 1, which means they take up roughly 1.58 bits of memory per parameter. A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model with the same number of parameters and amount of training, but it was 2.71 times as fast, used 72 percent less GPU memory, and used 94 percent less GPU energy. Wei called this an “aha moment.” Further, the researchers found that as they trained larger models, efficiency advantages improved.

A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model.

This year, a team led by Che, of Harbin Institute of Technology, released a preprint on another LLM binarization method, called OneBit. OneBit combines elements of both PTQ and QAT. It uses a full-precision pretrained LLM to generate data for training a quantized version. The team’s 13-billion-parameter model achieved a perplexity score of around 9 on one dataset, versus 5 for a LLaMA model with 13 billion parameters. Meanwhile, OneBit occupied only 10 percent as much memory. On customized chips, it could presumably run much faster.

Wei, of Microsoft, says quantized models have multiple advantages. They can fit on smaller chips, they require less data transfer between memory and processors, and they allow for faster processing. Current hardware can’t take full advantage of these models, though. LLMs often run on GPUs like those made by Nvidia, which represent weights using higher precision and spend most of their energy multiplying them. New hardware could natively represent each parameter as a -1 or 1 (or 0), and then simply add and subtract values and avoid multiplication. “One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs,” Wei says.

“They should grow up together,” Huang, of the University of Hong Kong, says of 1-bit models and processors. “But it’s a long way to develop new hardware.”

Ars Technica - All content
Google goes “open AI” with Gemma, a free, open-weights chatbot familyBenj Edwards
Enlarge (credit: Google) On Wednesday, Google announced a new family of AI language models called Gemma, which are free, open-weights models built on technology similar to the more powerful but closed Gemini models. Unlike Gemini, Gemma models can run locally on a desktop or laptop computer. It's Google's first significant open large language model (LLM) release since OpenAI's ChatGPT started a frenzy for AI chatbots in 2022. Gemma models come in two sizes: Gemma 2B (2 billi
21. Únor 2024 v 23:01

Google goes “open AI” with Gemma, a free, open-weights chatbot family

Ars Technica - All content

Od: Benj Edwards

21. Únor 2024 v 23:01

Enlarge (credit: Google)

On Wednesday, Google announced a new family of AI language models called Gemma, which are free, open-weights models built on technology similar to the more powerful but closed Gemini models. Unlike Gemini, Gemma models can run locally on a desktop or laptop computer. It's Google's first significant open large language model (LLM) release since OpenAI's ChatGPT started a frenzy for AI chatbots in 2022.

Gemma models come in two sizes: Gemma 2B (2 billion parameters) and Gemma 7B (7 billion parameters), each available in pre-trained and instruction-tuned variants. In AI, parameters are values in a neural network that determine AI model behavior, and weights are a subset of these parameters stored in a file.

Developed by Google DeepMind and other Google AI teams, Gemma pulls from techniques learned during the development of Gemini, which is the family name for Google's most capable (public-facing) commercial LLMs, including the ones that power its Gemini AI assistant. Google says the name comes from the Latin gemma, which means "precious stone."

Read 5 remaining paragraphs | Comments

I, Cringely
AI and Moore’s Law: It’s the Chips, StupidRobert X. Cringely
Sorry I’ve been away: time flies when you are not having fun. But now I’m back. Moore’s Law, which began with a random observation by the late Intel co-founder Gordon Moore that transistor densities on silicon substrates were doubling every 18 months, has over the intervening 60+ years been both borne-out yet also changed from a lithography technical feature to an economic law. It’s getting harder to etch ever-thinner lines, so we’ve taken as a culture to emphasizing the cost part of Moore’s Law
15. Červen 2023 v 16:20

AI and Moore’s Law: It’s the Chips, Stupid

I, Cringely

Od: Robert X. Cringely

15. Červen 2023 v 16:20

Sorry I’ve been away: time flies when you are not having fun. But now I’m back.

Moore’s Law, which began with a random observation by the late Intel co-founder Gordon Moore that transistor densities on silicon substrates were doubling every 18 months, has over the intervening 60+ years been both borne-out yet also changed from a lithography technical feature to an economic law. It’s getting harder to etch ever-thinner lines, so we’ve taken as a culture to emphasizing the cost part of Moore’s Law (chips drop in price by 50 percent on an area basis (dollars per acre of silicon) every 18 months). We can accomplish this economic effect through a variety of techniques including multiple cores, System-On-Chip design, and unified memory — anything to keep prices going-down.

I predict that Generative Artificial Intelligence is going to go a long way toward keeping Moore’s Law in force and the way this is going to happen says a lot about the chip business, global economics, and Artificial Intelligence, itself.

Let’s take these points in reverse order. First, Generative AI products like ChatGPT are astoundingly expensive to build. GPT-4 reportedly cost $100+ million to build, mainly in cloud computing resources. Yes, this was primarily Microsoft paying itself and so maybe the economics are a bit suspect, but the actual calculations took tens of thousands of GPUs running for months and that can’t be denied. Nor can it be denied that building GPT-5 will cost even more.

Some people think this economic argument is wrong, that Large Language Models comparable to ChatGPT can be built using Open Source software for only a few hundred or a few thousand dollars. Yes and no.

Competitive-yet-inexpensive LLMs built at such low cost have nearly all started with Meta’s (Facebook’s) LLaMA (Large Language Model Meta AI), which has effectively become Open Source now that both the code and the associated parameter weights — a big deal in fine-tuning language models — have been released to the wild. It’s not clear how much of this Meta actually intended to do, but this genie is out of its bottle to great effect in the AI research community.

But GPT-5 will still cost $1+ billion and even ChatGPT, itself, is costing about $1 million per day just to run. That’s $300+ million per year to run old code.

So the current el cheapo AI research frenzy is likely to subside as LLaMA ages into obsolescence and has to be replaced by something more expensive, putting Google, Microsoft and OpenAI back in control. Understand, too, that these big, established companies like the idea of LLMs costing so much to build because that makes it harder for startups to disrupt. It’s a form of restraint of trade, though not illegal.

But before then — and even after then in certain vertical markets — there is a lot to learn and a lot of business to be done using these smaller models, which can be used to build true professional language models, which GPT-4 and ChatGPT definitely are not.

GPT-4 and ChatGPT are general purpose models — supposedly useful for pretty much anything. But that means that when you are asking ChatGPT for legal advice, for example, you are asking it to imitate a lawyer. While ChatGPT may be able to pass the bar test, so did my cousin Chad, whom I assure you is an idiot.

If you are reading this I’ll bet you are smarter than your lawyer.

This means there is an opportunity for vertical LLMs trained on different data — real data from industries like medicine and auto mechanics. Whoever owns this data will own these markets.

What will make these models both better and cheaper is they can be built from a LLaMA base because most of that data doesn’t have to change over time to still fix your car, and the added Machine Learning won’t be from crap found on the Internet, but rather from the service manuals actually used to train mechanics and fix cars.

We are approaching a time when LLMs won’t have to imitate mechanics and nurses because they will be trained like mechanics and nurses.

Bloomberg has already done this for investment advice using its unique database of historical financial information.

With an average of 50 billion nodes, these vertical models will cost only five percent as much to run as OpenAI’s one billion node GPT-4.

But what does this have to do with semiconductors and Moore’s Law? Chip design is very similar to fixing cars in that there is a very limited amount of Machine Learning data required (think of logic cells as language words). It’s a small vocabulary (the auto repair section at the public library is just a few shelves of books). And EVEN BETTER THAN AUTO REPAIR, the semiconductor industry has well-developed simulation tools for testing logic before it is actually built.

So it ought to be pretty simple to apply AI to chip design, building custom chip design models to iterate into existing simulators and refine new designs that actually have a pretty good chance of being novel.

And who will be the first to leverage this chip AI? China.

The USA is doing its best to freeze China out of semiconductor development, denying access to advanced manufacturing tools, for example. But China is arguably the world’s #2 country for AI research and can use that advantage to make up some of the difference.

Look for fabless AI chip startups to spring-up around Chinese universities and for the Chinese Communist Party to put lots of money into this very cost-effective work. Because even if it’s used just to slim-down and improve existing designs, that’s another generation of chips China might otherwise not have had at all.

The post AI and Moore’s Law: It’s the Chips, Stupid first appeared on I, Cringely.

Digital Branding
Web Design Marketing

Normální zobrazení

How to Make a 1-bit LLM

1-bit LLMs Find Success Against Their Larger Cousins