FreshRSS

Normální zobrazení

Jsou dostupné nové články, klikněte pro obnovení stránky.
PředevčíremHlavní kanál
  • ✇Semiconductor Engineering
  • LPDDR Memory Is Key For On-Device AI PerformanceNidish Kamath
    Low-Power Double Data Rate (LPDDR) emerged as a specialized high performance, low power memory for mobile phones. Since its first release in 2006, each new generation of LPDDR has delivered the bandwidth and capacity needed for major shifts in the mobile user experience. Once again, LPDDR is at the forefront of another key shift as the next wave of generative AI applications will be built into our mobile phones and laptops. AI on endpoints is all about efficient inference. The process of employi
     

LPDDR Memory Is Key For On-Device AI Performance

24. Červen 2024 v 09:02

Low-Power Double Data Rate (LPDDR) emerged as a specialized high performance, low power memory for mobile phones. Since its first release in 2006, each new generation of LPDDR has delivered the bandwidth and capacity needed for major shifts in the mobile user experience. Once again, LPDDR is at the forefront of another key shift as the next wave of generative AI applications will be built into our mobile phones and laptops.

AI on endpoints is all about efficient inference. The process of employing trained AI models to make predictions or decisions requires specialized memory technologies with greater performance that are tailored to the unique demands of endpoint devices. Memory for AI inference on endpoints requires getting the right balance between bandwidth, capacity, power and compactness of form factor.

LPDDR evolved from DDR memory technology as a power-efficient alternative; LPDDR5, and the optional extension LPDDR5X, are the most recent updates to the standard. LPDDR5X is focused on improving performance, power, and flexibility; it offers data rates up to 8.533 Gbps, significantly boosting speed and performance. Compared to DDR5 memory, LPDDR5/5X limits the data bus width to 32 bits, while increasing the data rate. The switch to a quarter-speed clock, as compared to a half-speed clock in LPDDR4, along with a new feature – Dynamic Voltage Frequency Scaling – keeps the higher data rate LPDDR5 operation within the same thermal budget as LPDDR4-based devices.

Given the space considerations of mobiles, combined with greater memory needs for advanced applications, LPDDR5X can support capacities of up to 64GB by using multiple DRAM dies in a multi-die package. Consider the example of a 7B LLaMa 2 model: the model consumes 3.5GB of memory capacity if based on INT4. A LPDDR5X package of x64, with two LPDDR5X devices per package, provides an aggregate bandwidth of 68 GB/s and, therefore, a LLaMa 2 model can run inference at 19 tokens per second.

As demand for more memory performance grows, we see LPDDR5 evolve in the market with the major vendors announcing additional extensions to LPDDR5 known as LPDDR5T, with the T standing for turbo. LPDDR5T boosts performance to 9.6 Gbps enabling an aggregate bandwidth of 76.8 GB/s in a x64 package of multiple LPDDR5T stacks. Therefore, the above example of a 7B LLaMa 2 model can run inference at 21 tokens per second.

With its low power consumption and high bandwidth capabilities, LPDDR5 is a great choice of memory not just for cutting-edge mobile devices, but also for AI inference on endpoints where power efficiency and compact form factor are crucial considerations. Rambus offers a new LPDDR5T/5X/5 Controller IP that is fully optimized for use in applications requiring high memory throughput and low latency. The Rambus LPDDR5T/5X/5 Controller enables cutting-edge LPDDR5T memory devices and supports all third-party LPDDR5 PHYs. It maximizes bus bandwidth and minimizes latency via look-ahead command processing, bank management and auto-precharge. The controller can be delivered with additional cores such as the In-line ECC or Memory Analyzer cores to improve in-field reliability, availability and serviceability (RAS).

The post LPDDR Memory Is Key For On-Device AI Performance appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • Powering Next-Generation Insightful DesignSara Louie
    The Ansys team is gearing up for an exciting time at DAC this week, where we’ll be sharing a whole new way of visualizing physical phenomena in 3D-IC designs, powered by NVIDIA Omniverse, a platform for developing OpenUSD and RTX-enabled 3D applications and workflows. Please attend our Exhibitor Forum session so we can show you the valuable design insights you can gain by interactively viewing surface currents, temperatures and mechanical deformations in a representative 3D-IC design. Visualizin
     

Powering Next-Generation Insightful Design

24. Červen 2024 v 09:01

The Ansys team is gearing up for an exciting time at DAC this week, where we’ll be sharing a whole new way of visualizing physical phenomena in 3D-IC designs, powered by NVIDIA Omniverse, a platform for developing OpenUSD and RTX-enabled 3D applications and workflows. Please attend our Exhibitor Forum session so we can show you the valuable design insights you can gain by interactively viewing surface currents, temperatures and mechanical deformations in a representative 3D-IC design.

Visualizing physical phenomena in 3D is a new paradigm for IC packaging signal and power integrity (SI/PI) engineers who are more familiar with schematics and 2D results plots (TDR, eye diagrams, SYZ parameters, etc). There’s a good reason for that – it really hasn’t been practical to save 3D physics data – or even to run full 3D simulations – for complex IC package designs until more recent (~5-10 years) advancements in Ansys solver technologies along with increasing accessibility to high performance compute power. I have had the pleasure of supporting a few early adopters in the SI/PI engineering community as they used 3D field plotting in HFSS to gain design insights that helped them avoid costly tape-out delays and chip re-spin. Their experiences motivate my desire to share this invaluable capability with the greater IC packaging design community.

I started my career using HFSS to design antennas for biomedical applications. Like me, the greater antenna and RF component design community has been plotting fields in 3D from Day 1 – I don’t know of any antenna engineer that hasn’t plotted a 3D radiation pattern after running an HFSS simulation. What I do know is that I’ve met hundreds of SI/PI engineers who have never plotted surface currents in their package or PCB models after running an HFSS simulation. And that must change.

…but why exactly? What value does one gain by plotting fields in 3D? If you don’t know what kind of design insights you gain by plotting fields, please allow me to show you because seeing is revealing.

Let’s say I send a signal from point A and it reflects off 3 different plates before returning back to point A:

Fig. 1: The plot below shows the received power from a sensor placed at Point A (the same position as the emitter). Can you tell which bump in this two 2D plot is the original signal returned back home?

Fig. 2: No? Neither can I. Even if you somehow guessed correctly, could you explain what caused all those other bumps with certainty?

With HFSS field plotting, everything becomes crystal clear:

Video 1: 3-bounce animation.

The original signal returns after a travel distance of 6 meters around the circle while the rest of the bumps result from other reflections – this is the kind of physical insight that a 2D plot simply can’t deliver.

Advanced packaging for 3D-IC design can’t be done with generic rules of thumb. To meet stringent specifications (operate at higher frequencies, speeds, and lower latency, power consumption), engineers are demanding the use of HFSS because its gold-standard reputation has come from countless validation studies comparing simulated results against measurement, and there is no room to compromise on accuracy for these very complex and costly designs. As more electronics are packed into tighter spaces, the risk for unwanted coupling between the different components (often stacked vertically) increases and being able to identify the true aggressors becomes more challenging. That’s when field visualization as a means to debug – i.e., uncover, learn, truly understand – what’s happening becomes an invaluable tool.

What exactly is causing unwanted radiation or reflections? What exactly is the source of noise coupling into this line? Are we seeing any current crowding on conductors or significant volume losses in dielectrics that may lead to thermal problems? Plot the fields and you will have the information required to diagnose issues and design the exact right solution – nothing more (over-designed), nothing less (failing design).

If you’re one of the early adopters who has plotted fields in highly complex designs, you will know that fields post-processing can be very graphics intensive – especially if you want to immerse yourself in the physical phenomena taking place in your design by plotting in a full 3D volume and cutting through the 3D space layer by layer. Ansys uses the enhanced graphics and visual rendering capabilities offered by NVIDIA Omniverse core technologies, available as APIs, to provide a seamlessly interactive and more intuitive experience, increasing accessibility to design insights that engineers can only gain by physics visualization.

I’m not asking my SI/PI engineering colleagues to ditch the schematics and 2D results plots – I just believe that they should (and inevitably will!) add 3D field plotting into their design process. The rise in design complexity coupled with advancements in Ansys and NVIDIA technologies is poised to make 3D field plotting – an invaluable yet heretofore underutilized tool in the world of signal and power integrity design – a practical requirement. For a first look into what will one day be commonplace design practice, please attend our Exhibitor Forum session on Wed., June 26, 1:45 PM -3:00 PM at DAC to experience our interactive demonstration of Ansys physics visualization in a representative 3D-IC design, powered by NVIDIA Omniverse.

The post Powering Next-Generation Insightful Design appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • Margin Sensors In The WildBarry Pangrle
    Back in March, I wrote up an article here that looked at how a proxy circuit could be used to measure variations in circuit performance as conditions changed in the operating environment. There were a couple of recent presentations on margin sensors at two of the big EDA vendors’ customer engineering forums that we’ll look at as well as another product with an upcoming presentation at DAC. Margin sensors have applications for silicon health and performance monitoring for SoCs, characterization,
     

Margin Sensors In The Wild

10. Červen 2024 v 09:01

Back in March, I wrote up an article here that looked at how a proxy circuit could be used to measure variations in circuit performance as conditions changed in the operating environment. There were a couple of recent presentations on margin sensors at two of the big EDA vendors’ customer engineering forums that we’ll look at as well as another product with an upcoming presentation at DAC. Margin sensors have applications for silicon health and performance monitoring for SoCs, characterization, yield, reliability, safety, power, and performance. How they are configured, though, determines their best suited tasks.

The first presentation was given at Synopsys’ SNUG Silicon Valley on March 20, 2024, titled “Diagnosis of Timing Margin on Silicon with PMM (Path Margin Monitor)”, by Gurnrack Moon, Principal Engineer at Samsung. One of the key aspects of the PMM that Samsung appreciated was the closer correlation between the PMM and the actual paths versus, say, using a Ring Oscillator approach.

Fig. 1: Synopsys Path Margin Monitor diagram. (Source: Synopsys)

My previous article described how the “Monitor Logic” portion of the PMM diagram shown above in figure 1 would conceptually work. Taps taken along the synthetic circuit of buffers could be compared to see how far the signal made it down the path and thus determine how much margin is available. A strength of this approach is that it allows one PMM to be used on multiple paths. It does have a disadvantage, though, of introducing additional control overhead and adding additional delay components in to the monitor path.

The PMMs on the chip are connected in a daisy-chain fashion which reduces the number of signals needed to send information from the PMMs to the Path Margin Monitor Controller. This also reduces the number of signals for communication. This setup efficiently uses chip area to provide information about the state of the silicon. Typically, one might expect this type of capability to be exercised in a “diagnostic” mode where data would be captured, analyzed, and then used to determine appropriate voltage and frequency settings as opposed to a more dynamic or adaptive approach.

Samsung appreciated being able to “determine if there are problems or what is different from what is designed, and what needs to be improved. In addition, PMM data fed to the Synopsys Silicon.da analytics platform provides rich analytics, shortening the debug/analysis time.” This was used on production silicon. Synopsys also has other blog articles here and here for the interested reader.

The second presentation was given at CadenceLIVE Silicon Valley, April 17, 2024, titled “Challenges in Datacenters: Search for Advanced Power Management Mechanisms”, and presented by Ziv Paz, Vice President of Business Development at proteanTecs. His presentation focused on proteanTecs’ Margin Agents and noted how these sensors were sensitive to process, aging, workload stress, latent defects, operating conditions, DC IR drops, and local Vdroops.

Fig. 2: Reducing voltage while staying within margin. (Source: proteanTecs, CadenceLIVE)

Figure 2 shows how designers must handle “worst-case” scenarios and often do so by creating enough margin to operate under those conditions. In the diagram shown here, that margin shows up as a higher operating VDD. If the normal operating mode is 650mV with an allowance for a -10% change in VDD then the design is implemented to run at 585mV (90% * 650mV). Most of the time though, the circuitry will operate properly below 650mV so that running at 650mV is just wasting energy.

proteanTecs then presented a case study that was designed using TSMC’s 5nm technology. The chip incorporated 448 margin agents consisting of buffers with a unit delay of 7ps.

Fig. 3: Example margin agents and corresponding voltage. (Source: proteanTecs, CadenceLIVE)

Figure 3 above shows the margin agents (all 448) on the left side with the thicker black line showing the worst case for all 448. The right side shows the voltage. It also demonstrates that when the threshold is lowered the voltage will now drop to 614mV and the design continues to operate properly.

Fig. 4: Example margin agents with droop and corresponding voltage. (Source: proteanTecs, CadenceLIVE)

Figure 4 shows that as the voltage on the right drops that the worst-case margin agent values also drop and once they cross the yellow(-ish) line the voltage is signaled to return to the pre-AVS voltage of 650mV. The margin agent values then improve and the AVS voltage of 614mV will kick back in. By reacting when the margin agents cross the yellow line, it allows time for the voltage to increase and adjust before the voltage hits the red (585mV) line, thus always keeping it in the proper operating zone.

For this case, proteanTecs saw a 10.77% power saving and said that they’ve typically seen savings in the 9%-14% range. For this data center-oriented customer, this was important because of a limited power budget per rack, cooling limitations, carbon neutrality requirements (PUE), and a high CAPEX. Other benefits are a higher MTTF, lower maintenance costs, and a prolonged system lifetime. proteanTecs claimed a minimal impact on area and that currently most of their designs are in 7nm, 5nm, and below.

The third vendor announced their Aeonic Insight product line including a droop detector on November 14, 2023. Movellus’ Michael Durr, Director of Application Engineering is scheduled to give a talk at DAC on Wednesday, June 26, 2024, titled “Droop! There it is!” Movellus has been long known for their digital clock generation IP and, as one might guess, their design uses a synthetic circuit for detecting changes in the operating environment. Leveraging their clock generation expertise, they are initially targeting an adaptive frequency (or clock) scaling (AFS) approach that also leverages their digital clock generation IP.

The post Margin Sensors In The Wild appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • High-Level Synthesis Propels Next-Gen AI AcceleratorsRussell Klein
    Everything around you is getting smarter. Artificial intelligence is not just a data center application but will be deployed in all kinds of embedded systems that we interact with daily. We expect to talk to and gesture at them. We expect them to recognize and understand us. And we expect them to operate with just a little bit of common sense. This intelligence is making these systems not just more functional and easier to use, but safer and more secure as well. All this intelligence comes from
     

High-Level Synthesis Propels Next-Gen AI Accelerators

20. Květen 2024 v 09:01

Everything around you is getting smarter. Artificial intelligence is not just a data center application but will be deployed in all kinds of embedded systems that we interact with daily. We expect to talk to and gesture at them. We expect them to recognize and understand us. And we expect them to operate with just a little bit of common sense. This intelligence is making these systems not just more functional and easier to use, but safer and more secure as well.

All this intelligence comes from advances in deep neural networks. One of the key challenges of neural networks is their computational complexity. Small neural networks can take millions of multiply accumulate operations (MACs) to produce a result. Larger ones can take billions. Large language models, and similarly complex networks, can take trillions. This level of computation is beyond what can be delivered by embedded processors.

In some cases, the computation of these inferences can be off-loaded over a network to a data center. Increasingly, devices have fast and reliable network connections – making this a viable option for many systems. However, there are also a lot of systems that have hard real time requirements that cannot be met by even the fastest and most reliable networks. For example, any system that has autonomous mobility – self-driving cars or self-piloted drones – needs to make decisions faster than could be done through an off-site data center. There are also systems where sensitive data is being processed that should not be sent over networks. And anything that goes over a network introduces an additional attack surface for hackers. For all of these reasons – performance, privacy, and security – some inferencing will need to be done on embedded systems.

For very simple networks, embedded CPUs can handle the task. Even a Raspberry Pi can deploy a simple object recognition algorithm. For more complex tasks there are embedded GPUs, as well as neural processing units (NPUs) targeted at embedded systems that can deliver greater computational capability. But for the highest levels of performance and efficiency, building a bespoke AI (Artificial Intelligence) accelerator can enable applications that would otherwise be impractical.

Engineering a new piece of hardware is a daunting undertaking, whether for ASIC or FPGA. But it enables developers to reach a level of performance and efficiency not possible with off-the-shelf components. But how can the average development team build a better machine learning accelerator than the designers creating the most leading-edge commercial AI accelerators, with multiple generations under their belt? By highly customizing the implementation to the specific inference being performed, the implementation can be an order of magnitude better than more generalized solutions.

When a general-purpose AI accelerator developer creates an NPU, their goal is to support any neural network that anyone might conceive. They want to get thousands of design ins, so they have to make the design as general as possible. Not only that, but they also aim to have some level of “future proofing” built into their designs. They want to be able to support any network that might be imagined for several years into the future. Not an easy task in a technology that is evolving so rapidly.

A bespoke accelerator needs to only support the one, or perhaps several, networks to be used. This freedom allows many programmable elements in the implementation of the accelerator to be fixed in hardware. This creates hardware that is both smaller and faster than something general purpose. For example, a dedicated convolution accelerator, with a fixed image and filter size, can be up to 10 times faster than a well-designed general purpose TPU.

General purpose accelerators usually use floating point numbers. This is because virtually all neural networks are developed in Python on general purpose computers using floating point numbers. To ensure correct support of those neural networks, the accelerator must, of course, support floating point numbers. However, most neural networks use numbers close to 0, and require a lot of precision there. And floating-point multipliers are huge. If they are not needed, omitting them from the design saves a lot of area and power.

Some NPUs support integer representation, and sometimes with a variety of sizes. But supporting multiple numeric representation formats adds circuitry, which consumes power and adds propagation delays. Choosing one representation and using that exclusively enables a smaller faster implementation.

When building a bespoke accelerator, one is not limited to 8 bits or 16 bits, any size can be used. Picking the correct numeric representation, or “quantizing” a neural network, allows the data and the operators to be optimally sized. Quantization can significantly reduce the data needed to be stored, moved, and operated on. Reducing the memory footprint for the weight database and shrinking the multipliers can really improve the area and power of a design. For example, a 10-bit fixed-point multiplier is about 20 times smaller than a 32-bit floating-point multiplier, and, correspondingly, will use about 1/20th the power. This means the design can either be much smaller and energy efficient by using the smaller multiplier, or the designer can opt to use the area and deploy 20 multipliers that can operate in parallel, producing much higher performance using the same resources.

One of the key challenges in building a bespoke machine learning accelerator is that the data scientists who created the neural network usually do not understand hardware design, and the hardware designers do not understand data science. In a traditional design flow, they would use “meetings” and “specifications” to transfer knowledge and share ideas. But, honestly, no one likes meetings or specifications. And they are not particularly good at effecting an information exchange.

High-Level Synthesis (HLS) allows an implementation produced by the data scientists to be used, not just as an executable reference, but as a machine-readable input to the hardware design process. This eliminates the manual reinterpretation of the algorithm in the design flow, which is slow and extremely error prone. HLS synthesizes an RTL implementation from an algorithmic description. Usually, the algorithm is described in C++ or SystemC, but a number of design flows like HLS4ML are enabling HLS tools to take neural network descriptions directly from machine learning frameworks.

HLS enables a practical exploration of quantization in a way that is not yet practical in machine learning frameworks. To fully understand the impact of quantization requires a bit accurate implementation of the algorithm, including the characterization of the effects of overflow, saturation, and rounding. Today this in only practical in hardware description languages (HDLs) or HLS bit accurate data types (https://hlslibs.org).

As machine learning becomes ubiquitous, more embedded systems will need to deploy inferencing accelerators. HLS is a practical and proven way to create bespoke accelerators, optimized for a very specific application, that deliver higher performance and efficiency than general purpose NPUs.

For more information on this topic, read the paper: High-Level Synthesis Enables the Next Generation of Edge AI Accelerators.

The post High-Level Synthesis Propels Next-Gen AI Accelerators appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • Efficient ElectronicsAndy Heinig
    Attention nowadays has turned to the energy consumption of systems that run on electricity. At the moment, the discussion is focused on electricity consumption in data centers: if this continues to rise at its current rate, it will account for a significant proportion of global electricity consumption in the future. Yet there are other, less visible electricity consumers whose power needs are also constantly growing. One example is mobile communications, where ongoing expansion – especially with
     

Efficient Electronics

16. Květen 2024 v 09:07

Attention nowadays has turned to the energy consumption of systems that run on electricity. At the moment, the discussion is focused on electricity consumption in data centers: if this continues to rise at its current rate, it will account for a significant proportion of global electricity consumption in the future. Yet there are other, less visible electricity consumers whose power needs are also constantly growing. One example is mobile communications, where ongoing expansion – especially with the new current 5G standard and the future 6G standard – is pushing up the number of base stations required. This, too, will drive up electricity demand, as the latter increases linearly with the number of stations; at least, if the demand per base station is not reduced. Another example is electronics for the management of household appliances and in the industrial sector: more and more such systems are being installed, and their electronics are becoming significantly more powerful. They are not currently optimized for power consumption, but rather for performance.

This state of affairs simply cannot continue into the future for two reasons: first, the price of electricity will continue to rise worldwide; and second, many companies are committed to becoming carbon neutral. Their desire for carbon neutrality in turn makes electricity yet more expensive and restricts the overall quantity much more severely. As a result, there will be a significant demand for efficient electronics in the coming years, particularly as regards electricity consumption.

This development is already evident today, especially in power electronics, where the use of new semiconductor materials such as GaN or SiC has made it possible to reduce power consumption. A key driver for the development and introduction of such new materials was the electric car market, as reduced losses in the electronics leads directly to increased vehicle range. In the future, these materials will also find their way into other areas; for instance, they are already beginning to establish themselves in voltage transformers in various industries. However, this shift requires more factories and more suppliers for production, and further work also needs to be carried out to develop appropriate circuit concepts for these technologies.

In addition to the use of new materials, other concepts to reduce energy consumption are needed. The data center sector will require increasingly better-adapted circuits – ones that have been developed for a specific task, and as a result can perform this task much more efficiently than universal processors. This involves striking the optimum balance between universal architectures, such as microprocessors and graphics cards, and highly specialized architectures that are suitable for only one use case. Some products will also fall between these two extremes. The increased energy efficiency is then “purchased” through the effort and expense of developing exceptionally specially adapted architectures. It’s important to note that the more specialized an adapted architecture is, the smaller the market for it. That means the only way such architectures will be economically viable is if they can be developed efficiently. This calls for new approaches to derive these architectures directly from high-level hardware/software optimization, without the additional implementation steps that are still necessary today. In sum, the only way to make this approach possible is by using novel concepts and tools to generate circuits directly from a high-level description.

The post Efficient Electronics appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • How To Successfully Deploy GenAI On Edge DevicesGordon Cooper
    Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking for ways to improve their own products or customer experiences with this innovative technology. Since the ‘cost’
     

How To Successfully Deploy GenAI On Edge Devices

16. Květen 2024 v 09:06

Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking for ways to improve their own products or customer experiences with this innovative technology. Since the ‘cost’ of adding GenAI includes a significant jump in computational complexity and power requirements versus previous AI models, can this class of AI algorithms be applied to practical edge device applications where power, performance and cost are critical? It depends.

What is GenAI?

A simple definition of GenAI is ‘a class of machine learning algorithms that can produce various types of content including human like text and images.’ Early machine learning algorithms focused on detecting patterns in images, speech or text and then making predictions based on the data. For example, predicting the percentage likelihood that a certain image included a cat. GenAI algorithms take the next step – they perceive and learn patterns and then generate new patterns on demand by mimicking the original dataset. They generate a new image of a cat or describe a cat in detail.

While ChatGPT might be the most well-known GenAI algorithm, there are many available, with more being released on a regular basis. Two major types of GenAI algorithms are text-to-text generators – aka chatbots – like ChatGPT, GPT-4, and Llama2, and text-to-image generative model like DALLE-2, Stable Diffusion, and Midjourney. You can see example prompts and their returned outputs of these two types of GenAI models in figure 1. Because one is text based and one is image based, these two types of outputs will demand different resources from edge devices attempting to implement these algorithms.

Fig. 1: Example GenAI outputs from a text-to-image generator (DALLE-2) and a text-to-text generator (ChatGPT).

Edge device applications for Gen AI

Common GenAI use cases require connection to the internet and from there access to large server farms to compute the complex generative AI algorithms. However, for edge device applications, the entire dataset and neural processing engine must reside on the individual edge device. If the generative AI models can be run at the edge, there are potential use cases and benefits for applications in automobiles, cameras, smartphones, smart watches, virtual and augmented reality, IoT, and more.

Deploying GenAI on edge devices has significant advantages in scenarios where low latency, privacy or security concerns, or limited network connectivity are critical considerations.

Consider the possible application of GenAI in automotive applications. A vehicle is not always in range of a wireless signal, so GenAI needs to run with resources available on the edge. GenAI could be used for improving roadside assistance and converting a manual into an AI-enhanced interactive guide. In-car uses could include a GenAI-powered virtual voice assistant, improving the ability to set navigation, play music or send messages with your voice while driving. GenAI could also be used to personalize your in-cabin experience.

Other edge applications could benefit from generative AI. Augmented Reality (AR) edge devices could be enhanced by locally generating overlay computer-generated imagery and relying less heavily on cloud processing. While connected mobile devices can use generative AI for translation services, disconnected devices should be able to offer at least a portion of the same capabilities. Like our automotive example, voice assistant and interactive question-and-answer systems could benefit a range of edge devices.

While uses cases for GenAI at the edge exist now, implementations must overcome the challenges related to computational complexity and model size and limitations of power, area, and performance inherent in edge devices.

What technology is required to enable GenAI?

To understand GenAI’s architectural requirements, it is helpful to understand its building blocks. At the heart of GenAI’s rapid development are transformers, a relatively new type of neural network introduced in a Google Brain paper in 2017. Transformers have outperformed established AI models like Recurrent Neural Networks (RNNs) for natural language processing and Convolutional Neural Networks (CNNs) for images, video or other two- or three-dimensional data. A significant architectural improvement of a transformer model is its attention mechanism. Transformers can pay more attention to specific words or pixels than legacy AI models, drawing better inferences from the data. This allows transformers to better learn contextual relationships between words in a text string compared to RNNs and to better learn and express complex relationships in images compared to CNNs.

Fig. 2: Parameter sizes for various machine learning algorithms.

GenAI models are pre-trained on vast amounts of data which allows them to better recognize and interpret human language or other types of complex data. The larger the datasets, the better the model can process human language, for instance. Compared to CNN or vision transformer machine learning models, GenAI algorithms have parameters – the pretrained weights or coefficients used in the neural network to identify patterns and create new ones – that are orders of magnitude larger. We can see in figure 2 that ResNet50 – a common CNN algorithm used for benchmarking – has 25 million parameters (or coefficients). Some transformers like BERT and Vision Transformer (ViT) have parameters in the hundreds of millions. While other transformers, like Mobile ViT, have been optimized to better fit in embedded and mobile applications. MobileViT is comparable to the CNN model MobileNet in parameters.

Compared to CNN and vision transformers, ChatGPT requires 175 billion parameters and GPT-4 requires 1.75 trillion parameters. Even GPUs implemented in server farms struggle to execute these high-end large language models. How could an embedded neural processing unit (NPU) hope to complete so many parameters given the limited memory resources of edge devices? The answer is they cannot. However, there is a trend toward making GenAI more accessible in edge device applications, which have more limited computation resources. Some LLM models are tuned to reduce the resource requirements for a reduced parameter set. For example, Llama-2 offers a 70 billion parameter version of their model, but they also have created smaller models with fewer parameters. Llama-2 with seven billion parameters is still large, but it is within reach of a practical embedded NPU implementation.

There is no hard threshold for generative AI running on the edge, however, text-to-image generators like Stable Diffusion with one billion parameters can run comfortably on an NPU. And the expectation is for edge devices to run LLMs up to six to seven billion parameters. MLCommons have added GPT-J, a six billion parameter GenAI model, to their MLPerf edge AI benchmark list.

Running GenAI on the edge

GenAI algorithms require a significant amount of data movement and computation complexity (with transformer support). The balance of those two requirements can determine whether a given architecture is compute-bound – not enough multiplications for the data available – or memory bound – not enough memory and/or bandwidth for all the multiplications required for processing. Text-to-image has a better mix of compute and bandwidth requirements – more computations needed for processing two dimensional images and fewer parameters (in the one billion range). Large language models are more lopsided. There is less compute required, but a significantly large amount of data movement. Even the smaller (6-7B parameter) LLMs are memory bound.

The obvious solution is to choose the fastest memory interface available. From figure 3, you can see that a typically memory used in edge devices, LPDDR5, has a bandwidth of 51 Gbps, while HBM2E can support up to 461 Gbps. This does not, however, take into consideration the power-down benefits of LPDDR memory over HBM. While HBM interfaces are often used in high-end server-type AI implementations, LPDDR is almost exclusively used in power sensitive applications because of its power down abilities.

Fig. 3: The bandwidth and power difference between LPDDR and HBM.

Using LPDDR memory interfaces will automatically limit the maximum data bandwidth achievable with an HBM memory interface. That means edge applications will automatically have less bandwidth for GenAI algorithms than an NPU or GPU used in a server application. One way to address bandwidth limitations is to increase the amount of on-chip L2 memory. However, this impacts area and, therefore, silicon cost. While embedded NPUs often implement hardware and software to reduce bandwidth, it will not allow an LPDDR to approach HBM bandwidths. The embedded AI engine will be limited to the amount of LPDDR bandwidth available.

Implementation of GenAI on an NPX6 NPU IP

The Synopsys ARC NPX6 NPU IP family is based on a sixth-generation neural network architecture designed to support a range of machine learning models including CNNs and transformers. The NPX6 family is scalable with a configurable number of cores, each with its own independent matrix multiplication engine, generic tensor accelerator (GTA), and dedicated direct memory access (DMA) units for streamlined data processing. The NPX6 can scale for applications requiring less than one TOPS of performance to those requiring thousands of TOPS using the same development tools to maximize software reuse.

The matrix multiplication engine, GTA and DMA have all been optimized for supporting transformers, which allow the ARC NPX6 to support GenAI algorithms. Each core’s GTA is expressly designed and optimized to efficiently perform nonlinear functions, such as ReLU, GELU, sigmoid. These are implemented using a flexible lookup table approach to anticipate future nonlinear functions. The GTA also supports other critical operations, including SoftMax and L2 normalization needed in transformers. Complementing this, the matrix multiplication engine within each core can perform 4,096 multiplications per cycle. Because GenAI is based on transformers, there are no computation limitations for running GenAI on the NPX6 processor.

Efficient NPU design for transformer-based models like GenAI requires complex multi-level memory management. The ARC NPX6 processor has a flexible memory hierarchy and can support a scalable L2 memory up to 64MB of on chip SRAM. Furthermore, each NPX6 core is equipped with independent DMAs dedicated to the tasks of fetching feature maps and coefficients and writing new feature maps. This segregation of tasks allows for an efficient, pipelined data flow that minimizes bottlenecks and maximizes the processing throughput. The family also has a range of bandwidth reduction techniques in hardware and software to maximize bandwidth.

In an embedded GenAI application, the ARC NPX6 family will only be limited by the LPDDR available in the system. The NPX6 successfully runs Stable Diffusion (text-to-image) and Llama-2 7B (text-to-text) GenAI algorithms with efficiency dependent on system bandwidth and the use of on-chip SRAM. While larger GenAI models could run on the NPX6, they will be slower – measured in tokens per second – than server implementations. Learn more at www.synopsys.com/npx

The post How To Successfully Deploy GenAI On Edge Devices appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • DDR5 PMICs Enable Smarter, Power-Efficient Memory ModulesTim Messegee
    Power management has received increasing focus in microelectronic systems as the need for greater power density, efficiency and precision have grown apace. One of the important ongoing trends in service of these needs has been the move to localizing power delivery. To optimize system power, it’s best to deliver as high a voltage as possible to the endpoint where the power is consumed. Then at the endpoint, that incoming high voltage can be regulated into the lower voltages with higher currents r
     

DDR5 PMICs Enable Smarter, Power-Efficient Memory Modules

16. Květen 2024 v 09:05

Power management has received increasing focus in microelectronic systems as the need for greater power density, efficiency and precision have grown apace. One of the important ongoing trends in service of these needs has been the move to localizing power delivery. To optimize system power, it’s best to deliver as high a voltage as possible to the endpoint where the power is consumed. Then at the endpoint, that incoming high voltage can be regulated into the lower voltages with higher currents required by the endpoint components.

We saw this same trend play out in the architecting of the DDR5 generation of computer main memory. In planning for DDR5, the industry laid out ambitious goals for memory bandwidth and capacity. Concurrently, the aim was to maintain power within the same envelope as DDR4 on a per module basis. In order to achieve these goals, DDR5 required a smarter DIMM architecture; one that would embed more intelligence in the DIMM and increase its power efficiency. One of the biggest architectural changes of this smarter DIMM architecture was moving power management from the motherboard to an on-module Power Management IC (PMIC) on each DDR5 RDIMM.

In previous DDR generations, the power regulator on the motherboard had to deliver a low voltage at high current across the motherboard, through a connector and then onto the DIMM. As supply voltages were reduced over time (to maintain power levels at higher data rates), it was a growing challenge to maintain the desired voltage level because of IR drop. By implementing a PMIC on the DDR5 RDIMM, the problem with IR drop was essentially eliminated.

In addition, the on-DIMM PMIC allows for very fine-grain control of the voltage levels supplied to the various components on the DIMM. As such, DIMM suppliers can really dial in the best power levels for the performance target of a particular DIMM configuration. On-DIMM PMICs also offered an economic benefit. Power management on the motherboard meant the regulator had to be designed to support a system with fully populated DIMMs. On-DIMM PMICs means only paying for the power management capacity you need to support your specific system memory configuration.

The upshot is that power management has become a major enabler of increasing memory performance. Advancing memory performance has been the mission of Rambus for nearly 35 years. We’re intimate with memory subsystem design on modules, with expertise across many critical enabling technologies, and have demonstrated the disciplines required to successfully develop chips for the challenging module environment with its increased power density, space constraints and complex thermal management challenges.

As part of the development of our DDR5 memory interface chipset, Rambus built a world-class power management team and has now introduced a new family of DDR5 server PMICs. This new server PMIC product family lays the foundation for a roadmap of future power management chips. As AI continues to expand from training to inference, increasing demands on memory performance will extend beyond servers to client systems and drive the need for new PMIC solutions tailored for emerging use cases and form factors across the computing landscape.

Resources:

The post DDR5 PMICs Enable Smarter, Power-Efficient Memory Modules appeared first on Semiconductor Engineering.

  • ✇Semiconductor Engineering
  • How Quickly Can You Take Your Idea To Chip Design?Kira Jones
    Gone are the days of expensive tapeouts only done by commercial companies. Thanks to Tiny Tapeout, students, hobbyists, and more can design a simple ASIC or PCB design and actually send it to a foundry for a small fraction of the usual cost. Learners from all walks of life can use the resources to learn how to design a chip, without signing an NDA or installing licenses, faster than ever before. Whether you’re a digital, analog, or mixed-signal designer, there’s support for you. We’re excited to
     

How Quickly Can You Take Your Idea To Chip Design?

16. Květen 2024 v 09:04

Gone are the days of expensive tapeouts only done by commercial companies. Thanks to Tiny Tapeout, students, hobbyists, and more can design a simple ASIC or PCB design and actually send it to a foundry for a small fraction of the usual cost. Learners from all walks of life can use the resources to learn how to design a chip, without signing an NDA or installing licenses, faster than ever before. Whether you’re a digital, analog, or mixed-signal designer, there’s support for you.

We’re excited to support our academic network in participating in this initiative to gain more hands-on experience that will prepare them for a career in the semiconductor industry. We have professors incorporating it into the classroom, giving students the exciting opportunity to take their ideas from concept to reality.

“It gives people this joy when we design the chip in class. The 50 students that took the class last year, they designed a chip and Google funded it, and every time they got their design on the chip, their eyes got really big. I love being able to help students do that, and I want to do that all over the country,” said Matt Morrison, associate teaching professor in computer science and engineering, University of Notre Dame.

We also advise and encourage extracurricular design teams to challenge themselves outside the classroom. We partner with multiple design teams focused on creating an environment for students to explore the design flow process from RTL-to-GDS, and Tiny Tapeout provides an avenue for them.

“Just last year, SiliconJackets was founded by Zachary Ellis and me as a Georgia Tech club that takes ideas to SoC tapeout. Today, I am excited to share that we submitted the club’s first-ever design to Tiny Tapeout 6. This would not have been possible without the help from our advisors, and industry partners at Apple and Cadence,” said Nealson Li, SiliconJackets vice president and co-founder.

Tiny Tapeout also creates a culture of knowledge sharing, allowing participants to share their designs online, collaborate with one another, and build off an existing design. This creates a unique opportunity to learn from others’ experiences, enabling faster learning and more exposure.

“One of my favorite things about this project is that you’re not only going to get your design, but everybody else’s as well. You’ll be able to look through the chips’ data sheet and try out someone else’s design. In our previous runs, we’ve seen some really interesting designs, including RISC-V CPUs, FPGAs, ring oscillators, synthesizers, USB devices, and loads more,” said Matt Venn, science & technology communicator and electronic engineer.

Tiny Tapeout is on its seventh run, launched on April 22, 2024, and will remain open until June 1, 2024, or until all the slots fill up! With each run, more unique designs are created, more knowledge is shared, and more of the future workforce is developed. Check out the designs that were just submitted for Tiny Tapeout 6.

What can you expect when you participate?

  • Access to training materials
  • Ability to create your own design using one of the templates
  • Support from the FAQs or Tiny Tapeout community

Interested in learning more? Check out their webpage. Want to use Cadence tools for your design? Check out our University Program and what tools students can access for free. We can’t wait to see what you come up with and how it’ll help you launch a career in the electronics industry!

The post How Quickly Can You Take Your Idea To Chip Design? appeared first on Semiconductor Engineering.

❌
❌