Normální zobrazení

Jsou dostupné nové články, klikněte pro obnovení stránky.

PředevčíremSemiconductor Engineering

Semiconductor Engineering
High-Level Synthesis Propels Next-Gen AI AcceleratorsRussell Klein
Everything around you is getting smarter. Artificial intelligence is not just a data center application but will be deployed in all kinds of embedded systems that we interact with daily. We expect to talk to and gesture at them. We expect them to recognize and understand us. And we expect them to operate with just a little bit of common sense. This intelligence is making these systems not just more functional and easier to use, but safer and more secure as well. All this intelligence comes from
20. Květen 2024 v 09:01

High-Level Synthesis Propels Next-Gen AI Accelerators

Od: Russell Klein

20. Květen 2024 v 09:01

Everything around you is getting smarter. Artificial intelligence is not just a data center application but will be deployed in all kinds of embedded systems that we interact with daily. We expect to talk to and gesture at them. We expect them to recognize and understand us. And we expect them to operate with just a little bit of common sense. This intelligence is making these systems not just more functional and easier to use, but safer and more secure as well.

All this intelligence comes from advances in deep neural networks. One of the key challenges of neural networks is their computational complexity. Small neural networks can take millions of multiply accumulate operations (MACs) to produce a result. Larger ones can take billions. Large language models, and similarly complex networks, can take trillions. This level of computation is beyond what can be delivered by embedded processors.

In some cases, the computation of these inferences can be off-loaded over a network to a data center. Increasingly, devices have fast and reliable network connections – making this a viable option for many systems. However, there are also a lot of systems that have hard real time requirements that cannot be met by even the fastest and most reliable networks. For example, any system that has autonomous mobility – self-driving cars or self-piloted drones – needs to make decisions faster than could be done through an off-site data center. There are also systems where sensitive data is being processed that should not be sent over networks. And anything that goes over a network introduces an additional attack surface for hackers. For all of these reasons – performance, privacy, and security – some inferencing will need to be done on embedded systems.

For very simple networks, embedded CPUs can handle the task. Even a Raspberry Pi can deploy a simple object recognition algorithm. For more complex tasks there are embedded GPUs, as well as neural processing units (NPUs) targeted at embedded systems that can deliver greater computational capability. But for the highest levels of performance and efficiency, building a bespoke AI (Artificial Intelligence) accelerator can enable applications that would otherwise be impractical.

Engineering a new piece of hardware is a daunting undertaking, whether for ASIC or FPGA. But it enables developers to reach a level of performance and efficiency not possible with off-the-shelf components. But how can the average development team build a better machine learning accelerator than the designers creating the most leading-edge commercial AI accelerators, with multiple generations under their belt? By highly customizing the implementation to the specific inference being performed, the implementation can be an order of magnitude better than more generalized solutions.

When a general-purpose AI accelerator developer creates an NPU, their goal is to support any neural network that anyone might conceive. They want to get thousands of design ins, so they have to make the design as general as possible. Not only that, but they also aim to have some level of “future proofing” built into their designs. They want to be able to support any network that might be imagined for several years into the future. Not an easy task in a technology that is evolving so rapidly.

A bespoke accelerator needs to only support the one, or perhaps several, networks to be used. This freedom allows many programmable elements in the implementation of the accelerator to be fixed in hardware. This creates hardware that is both smaller and faster than something general purpose. For example, a dedicated convolution accelerator, with a fixed image and filter size, can be up to 10 times faster than a well-designed general purpose TPU.

General purpose accelerators usually use floating point numbers. This is because virtually all neural networks are developed in Python on general purpose computers using floating point numbers. To ensure correct support of those neural networks, the accelerator must, of course, support floating point numbers. However, most neural networks use numbers close to 0, and require a lot of precision there. And floating-point multipliers are huge. If they are not needed, omitting them from the design saves a lot of area and power.

Some NPUs support integer representation, and sometimes with a variety of sizes. But supporting multiple numeric representation formats adds circuitry, which consumes power and adds propagation delays. Choosing one representation and using that exclusively enables a smaller faster implementation.

When building a bespoke accelerator, one is not limited to 8 bits or 16 bits, any size can be used. Picking the correct numeric representation, or “quantizing” a neural network, allows the data and the operators to be optimally sized. Quantization can significantly reduce the data needed to be stored, moved, and operated on. Reducing the memory footprint for the weight database and shrinking the multipliers can really improve the area and power of a design. For example, a 10-bit fixed-point multiplier is about 20 times smaller than a 32-bit floating-point multiplier, and, correspondingly, will use about 1/20^th the power. This means the design can either be much smaller and energy efficient by using the smaller multiplier, or the designer can opt to use the area and deploy 20 multipliers that can operate in parallel, producing much higher performance using the same resources.

One of the key challenges in building a bespoke machine learning accelerator is that the data scientists who created the neural network usually do not understand hardware design, and the hardware designers do not understand data science. In a traditional design flow, they would use “meetings” and “specifications” to transfer knowledge and share ideas. But, honestly, no one likes meetings or specifications. And they are not particularly good at effecting an information exchange.

High-Level Synthesis (HLS) allows an implementation produced by the data scientists to be used, not just as an executable reference, but as a machine-readable input to the hardware design process. This eliminates the manual reinterpretation of the algorithm in the design flow, which is slow and extremely error prone. HLS synthesizes an RTL implementation from an algorithmic description. Usually, the algorithm is described in C++ or SystemC, but a number of design flows like HLS4ML are enabling HLS tools to take neural network descriptions directly from machine learning frameworks.

HLS enables a practical exploration of quantization in a way that is not yet practical in machine learning frameworks. To fully understand the impact of quantization requires a bit accurate implementation of the algorithm, including the characterization of the effects of overflow, saturation, and rounding. Today this in only practical in hardware description languages (HDLs) or HLS bit accurate data types (https://hlslibs.org).

As machine learning becomes ubiquitous, more embedded systems will need to deploy inferencing accelerators. HLS is a practical and proven way to create bespoke accelerators, optimized for a very specific application, that deliver higher performance and efficiency than general purpose NPUs.

For more information on this topic, read the paper: High-Level Synthesis Enables the Next Generation of Edge AI Accelerators.

The post High-Level Synthesis Propels Next-Gen AI Accelerators appeared first on Semiconductor Engineering.

Semiconductor Engineering
Chip Aging Becoming Key Factor In Data Center EconomicsAnn Mutschler
Chip aging is becoming a much bigger concern inside of data centers, where it can impact server uptime, utilization rates, and the amount of energy needed to drive signals and cool entire server racks. Aging in chips is the result of both higher logic utilization and increasing transistor density. This is problematic for data centers, in general, but especially for AI chips where digital logic is expected to run at maximum speed. That generates more heat, which becomes harder to dissipate as the
20. Květen 2024 v 09:01

Chip Aging Becoming Key Factor In Data Center Economics

Semiconductor Engineering

Od: Ann Mutschler

20. Květen 2024 v 09:01

Chip aging is becoming a much bigger concern inside of data centers, where it can impact server uptime, utilization rates, and the amount of energy needed to drive signals and cool entire server racks.

Aging in chips is the result of both higher logic utilization and increasing transistor density. This is problematic for data centers, in general, but especially for AI chips where digital logic is expected to run at maximum speed. That generates more heat, which becomes harder to dissipate as the number specialized and general-purpose processing elements per square millimeter of silicon continues to rise. Heat typically gets trapped between the fins of finFETs and gate-all-around FETs, accelerating electromigration and reducing the time it takes for dielectrics to break down. It also can cause warpage, which can rupture the bonds and contacts between different components in an advanced package or on a PCB.

For data centers, that creates a number of challenges:

Thermal management: This requires a deep understanding of workloads and the resulting transient thermal gradients as processing is load-balanced on-chip, between chips or chiplets, and between servers;
More data: Data from sensors everywhere, along with larger training sets, all need to be processed faster than in the past to keep up with the flood of data, but all of that needs to happen in the same or smaller footprint without overheating any part of a device, and
In-circuit monitoring: Sensors can be added into chips to detect variations in heat and data speeds in different paths, but it’s much more difficult to keep track of tens of thousands of these monitors as they collect data from heterogeneous processing elements, each of which can age at different rates depending on process variation, defectivity, varying workloads, and ambient thermal conditions.

“Servers are much more capable today than they were 10 years ago, and the issue is that power hasn’t scaled like it used to,” said Steven Woo, Rambus fellow and distinguished inventor. “Now, if you want to do lots more work in your server, you have to burn more power to do it. Twenty years ago, a server might dissipate a couple hundred watts. But with the latest servers that NVIDIA just announced around Grace Blackwell, the whole rack is 120 kilowatts, and the individual servers are many kilowatts. Just delivering power into those racks is causing changes in the infrastructure in the industry. Now that you have to bring in and dissipate more power in a small space, you get all kinds of interesting things that could happen over time. The heat that’s being dissipated can have effects on the chip, and you have to worry sometimes about thermal cycling where, as the chip is doing a lot of work, maybe part of the chip stops and then it does more work. You get these rapid cycles of dissipating a lot of power, then not, then dissipating a lot of power, then not. That cycling causes local heating and cooling, leading to thermal stresses, and this impacts all chips, including memory.”

As a result, everyone from the data center manager to the chip architect now has to understand how a chip behaves in the field, and how increasingly customized chip and system architectures will function over time. Downtime is costly for a data center, but under-utilization and reduced performance also carries a high price tag. That, in turn, affects how much margin is considered essential, such as extra data paths if some of them are fully or partially closed off by electromigration, and how that margin will impact performance, power, and area/cost over a chip’s projected lifetime — especially in a heterogeneous design with specialized compute elements.

“When it comes to the hyper-scalers and high powered, highly customized, heterogeneous chips for various different workloads, these chips are on 24/7, so consistent uptime is critical,” said Dan Lee, product management director at Cadence. “Since all of these chips are done at the really advanced nodes, with the smaller device sizes, more developers are looking to do aging analysis, and derive the wear and tear so they can see if the chip is going to last a year or five years. At the same time, an important consideration is also thermal — especially when we’re talking about these heterogeneous integrations, and you don’t really get the thermal conductivity that you would in a straightforward, monolithic design. There’s a bit more thought or planning that needs to be a part of this because aging and heating are related. All things being equal, if you’re operating in a very hot environment, you’re going to expect a lower lifespan.”

Still, determining how much shorter that lifespan will be isn’t always a precise calculation. “Data center SoCs that execute mission-critical workloads need to provide scalable visibility, predict problems before they occur, provide deep-dive analyses into problems, and be optimized to increase longevity of investment,” said Padmakumar Karthik, senior technology manager at Arm. “Data center diagnostic patterns are often deployed to measure the health of an SoC post-manufacturing to prevent silent data corruption (SDC) issues. But on-chip sensors provide an additional layer of insights, detecting droops or aging or thermal events on-chip, all of which can cause SDC incidents. For this reason, scalable, customizable sensor frameworks that can monitor and adapt throughout the useful life of the device, enabling continuous design optimization and preventive maintenance, will be increasingly important.”

There are multiple ways to achieve this, but each data center can be very different. In some cases, chips are designed by systems companies for internal use. And in most cases, there is a mix of different hardware and software, not all of which is state-of-the-art. “Many data centers have legacy infrastructure that may not be inherently designed for optimal power efficiency,” noted Noam Brousard, vice president of systems at proteanTecs, in a recent blog. “Upgrading or retrofitting such infrastructure poses challenges in achieving comprehensive power optimization.”

Even within a single rack, stresses can vary greatly from one server to the next, and from one chip to the next even in the same server. “You can imagine when you have a very big chip, toward the edges of the chip it will expand more than in a small chip, and that can add stress,” said Rambus’ Woo. “You have to really be careful about how you cool things, and memory is no different. You have very specific things you worry about with memory, like the ability to retain data, depending on how hot the chip is.”

In addition, as chips age, parameters drift. Marc Swinnen, director of product marketing in Ansys’ semiconductor division, said the traditional approach has been to use a library that’s characterized as a brand new chip. “The library is characterized at 1 year, 5 years, 10 years, 15 years, and you can run all your analysis multiple times with these different aged libraries. That sounds good on paper, and that’s what a lot of people do, but the problem is that not all parts of the chip age at the same rate. This is why aging is often associated with activity and temperature. Some parts of the chip are more active and hotter than other parts of the chip, so the aging time runs differently for different parts. This means you want to apply some of the old library to some parts of the chip, and the younger library to other parts of the chip, because if signals run between them you have setup and hold issues. If everything slows down at the same time — or one slows down and the other one doesn’t — you’re going to get mismatches, and that’s the difficulty. At the bottom level, it’s easy. Every gate is assigned its right age. That’s simple. You do an analysis with every gate. But how do you assign the age to every gate? Where do you get that information from? You need a lot of realistic activity, and then predict that over the lifespan and with temperature. That’s the problem. How do you actually construct this aging map? Once you have it, the analysis is not that hard.”

Aging maps are application- and workload-specific. Every chip will age differently depending on the functions it performs.

But aging is just one of many factors that affect data center uptime. “When we look at data center, we look at the whole application first, then whittle it down to what that means for chips and packages,” said Kelly Morgan, senior principal application engineer at Ansys. “From the mechanical reliability lens of the data center operation, we go through thermal cycling, obviously. We’re in a controlled environment. But what does that influence? How does that influence the integrity of the chips as you go through thermal cycles? Typically, we’ll look at things like solder fatigue and other effects.”

Another factor to consider is shipping and handling, which can affect the aging of a chip, package, and board.

“Even before the device is put in place, there are opportunities for vibration,” Morgan said. “You might hit something, which is a bit of a shock. We have customers who are looking at things like drop, shock, and vibration, and they have goals they need to test to. Typically, the standard process is to do a lot of physical testing. Now as you can imagine, that can be pretty challenging. You have to be pretty far along in the design process before you really start to go and test, and if there’s an issue, then you’ve got to go back and retest. Early simulation helps here, especially for those larger-scale events, and that comes down to the chassis, the board, to all the components, including the ICs.”

Fig. 1: Components of complete electronic system analysis. Source: Ansys

Quality control remains a big challenge when it comes to mechanical stresses that can affect aging. Adam Cron, distinguished architect at Synopsys, pointed to a recent Intel white paper, which noted that at the current acceptable defectivity rates, one core fails every two days. To account for this, Cron noted that certain commercial tools support in-system delay testing in a BiST mode. By adding specific IP, any ATPG patterns could be added to that. (Intel’s paper said its solution only applies to stuck-at testing.)

“In very large, millions-of-cores data center-type environments, the implication is that you’d better be ready,” Cron said. “One of the things they were talking about in this paper was in-system scan. Intel was bringing a database of test patterns in, and then applying it in-system after isolating a core. And then, upon a failure, they’d quarantine and move on. But the data centers are apparently running out of that opportunistic time slot to do any of this. We’ve heard some interesting conversations about the fact that people do run a lot of things during certain times. However, other times are cheaper, so all the holes are just getting filled in terms of runtime. Monitors are certainly something to look at, but monitors are looking at systemic degradation. That’s known, if you will. And so as things degrade, V_min will change, maybe frequency will change. And they’ll be on a pace. They can figure out when to do that. That’s easy enough to figure out. However, if there’s a marginality or some broken component in there, it is not up to the tool to find that. And frankly, the in-system scan wasn’t addressing all components on the die. It was only up to like 80% of stuck-at coverage, which isn’t that much, especially when you’re not looking at all of the pieces inside the die. The point is, there are still opportunities to do better.”

Cron noted that one big systems company suggested a dual-core lockstep mechanism, starting out the data center in dual-core lock-step mode for X number of months. “When it looks like you’ve squeezed the major part of the curve out, in terms of finding these defective components, then unlock them, double your capacity, run like that for a while, and periodically hook some back up again. That means everything is utilized, at least. Of course, some are working at half capacity here and there, but it’s not the whole die. And there are some implications there from a design standpoint, at least for the hardware, but also possibly the operating system, depending on who decides what physical core is used versus what virtual core is used.”

Approaches to measuring aging
Any discussion around aging circuits really boils down to extending the life of the machines in the data center, and not getting caught by surprise when failures occur.

“How do you do that? You have to measure the aging of those machines,” said Neil Hand, director of marketing, IC segment at Siemens EDA. “Right now, if you speak to the CIOs of these big companies with big data centers, they say, ‘We’ve got to get rid of the machines after three years because we can’t risk it going down.’ If you look at embedded analytics capabilities, you can start to embed aging monitors in those devices, you can start to monitor those in real time. It doesn’t look that different than what it does from an automotive perspective. It’s all the same technologies, effectively, but you’re monitoring them. And then you can say, ‘We’re now at 90% of our life for this server.’ We can then just replace that server.”

This feeds into corporate goals around sustainability, as well. “It comes down to building the best thing to begin with, then building it with design for manufacturing in mind so that you don’t get waste during manufacturing, achieve better yields, and finally extend the life of products and build them in environmentally-sustainable ways,” Hand said. “If you can extend the data center lifecycle from three years to five years, that’s big. And especially if you start going to these high-performance, application-specific type of clusters, you may not need to change them as often, because if the underlying capabilities aren’t changing, that might drive the cycling of it. In the case of a biological computer, if there’s no new change to the underlying protein folding mechanisms, you might say, ‘We don’t need a new compute platform. This is really good.”

The longer the product life can be extended, the better. Design for aging is a matter of, first, performing the aging analysis with the foundry models. “Run the simulations and observe the effects,” said Cadence’s Lee. “When you’re doing the simulation, you want to have the right mission profiles, so you come up with an accurate prediction of how your device is going to behave after a certain number of years in deployment. You may want to combine that with thermal analysis, for example, because how that aging is going to behave will depend on what temperature this design is going to be working at. You may think it’s 22 degrees Celsius, but maybe through some thermal analysis you realize it’s actually going to be operating at 35 or 40 degrees most of the time. That may change the outcome of your aging analysis.”

In terms of the associated thermal analysis, this can extend beyond a single device. “It’s also how that heat is moving,” Lee said. “Let’s say you have this integrated design, where you have some power devices alongside some logic, or some other functionality that is lower power. What you may want to understand is, if those bandgaps or power circuits are generating a lot of heat, that may be shifting over into other parts of your design. So when you run your aging analysis, you may assume that you’re running at 25 degrees, whereas the power devices are at 40 or 45 degrees. They’re on the same chip, they’re very close to each other, and you have to understand how much of that heat is moving over to your logic and what that’s going to bring the temperature up to. You want to know that so you can perform the aging analysis based on that higher temperature.”

Another consideration is combining aging analysis and interconnect parasitics, which is especially relevant for advanced nodes due to the parasitics in the interconnect. “They’re dominant when it comes to performance and functionality,” Lee added. “So when thinking about aging, you also have to think about it being an aged device that has to push the electrons through this interconnect. That’s a pretty heavy load. When you’re doing the aging analysis, you probably will have to be doing it with extracted parasitics. You just can’t do it on a pure schematic design. It doesn’t give you enough detail about what’s really happening physically. This may be included in the aging analysis tool. When most people talk about aging, they may not think about the parasitic aspect to it.”

Combating aging, thermal in memory
While standards don’t work in custom silicon, they do work for some standard components in those devices, such as memory. Over the past 10 to 15 years, memory standards have started to address the impact of heat.

“If you start to exceed certain temperature limits, you’ve got to refresh the device more frequently because the charge can leak off the cells more quickly,” said Rambus’ Woo. “So there are temperature-dependent refresh rates. There are other things that can be exacerbated, like the capacitors are getting smaller, they’re holding fewer electrons because there are so many more of them on a chip now, so we’ve seen memories adopt on-die error correction. This on-die error correction is something that is hidden from the outside world. In many cases, you don’t even know an error has occurred and been corrected on the chip. Those kinds of technologies become even more important now because the temperatures can be higher.”

There also is growing demand for more telemetry to provide monitoring information. “You just want to know if anything is overheating,” said Woo. “Does something seem like it’s malfunctioning? The data center manager will get regular updates about the status of the major components of the system. A lot of boards now in servers have baseboard management controllers (BMCs), which are little chips that sit on each board and are responsible for, among other things, reporting back the health of that board when a server might have five or six boards. We’re frequently seeing more of these BMC chips.”

Design for aging
While the goal is to be able to guarantee a certain lifetime for the chips in a data center, the challenges for achieving that are expanding. “There’s a growing list of things that can be harmful to devices over their lifetime,” Woo said. “It’s a balance between not adding too much cost, even though you have to increase the reliability and maybe add new features, and all of these things are in play with each other.”

Whether it is liquid cooling or higher levels of RAS ECC in the system, there is no single best answer for every application. In general, the industry is moving toward higher reliability and increasing resilience, but there are many ways to get there and challenges with each of them.

“Just as 15 years ago we didn’t necessarily always think we had to talk about power, now we have to talk about it all the time,” Woo said. “The same thing is going to be true for resilience and reliability. It’s going to be required to become part of the way people think about architectures, and part of that is how the memory system improves its reliability. You can’t really do anything unless you can compute on some data, and you have to make sure that data is reliable. It will touch how memory is stored in a DRAM. It will touch how memory is communicated across links. And it even will touch how processors manipulate data once they get a hold of it in their caches, and in the compute pipelines. Also, one of the key things people will worry about is how much of that susceptibility is brought about by age-related issues, like heating cycles, etc.”

Finally, there are even issues around the quality of the power that comes into a system. “The servers get noise on the power rails, and it’s a balance between how much money you’re willing to pay for the power delivery versus the quality of power,” said Woo. “You have to be tolerant of those kinds of things, too. Power management becomes more challenging, as well as the amount of power that these systems are using today. NVIDIA systems bring 48-volt power into the racks, and there is talk about even higher voltage levels. Those changes in infrastructure can all impact heat, and can age components differently.”

The post Chip Aging Becoming Key Factor In Data Center Economics appeared first on Semiconductor Engineering.

Semiconductor Engineering
DRAM Microarchitectures And Their Impacts On Activate-Induced Bitflips Such As RowHammer Technical Paper Link
A technical paper titled “DRAMScope: Uncovering DRAM Microarchitecture and Characteristics by Issuing Memory Commands” was published by researchers at Seoul National University and University of Illinois at Urbana-Champaign. Abstract: “The demand for precise information on DRAM microarchitectures and error characteristics has surged, driven by the need to explore processing in memory, enhance reliability, and mitigate security vulnerability. Nonetheless, DRAM manufacturers have disclosed only a
19. Květen 2024 v 22:57

DRAM Microarchitectures And Their Impacts On Activate-Induced Bitflips Such As RowHammer

Semiconductor Engineering

Od: Technical Paper Link

19. Květen 2024 v 22:57

A technical paper titled “DRAMScope: Uncovering DRAM Microarchitecture and Characteristics by Issuing Memory Commands” was published by researchers at Seoul National University and University of Illinois at Urbana-Champaign.

Abstract:

“The demand for precise information on DRAM microarchitectures and error characteristics has surged, driven by the need to explore processing in memory, enhance reliability, and mitigate security vulnerability. Nonetheless, DRAM manufacturers have disclosed only a limited amount of information, making it difficult to find specific information on their DRAM microarchitectures. This paper addresses this gap by presenting more rigorous findings on the microarchitectures of commodity DRAM chips and their impacts on the characteristics of activate-induced bitflips (AIBs), such as RowHammer and RowPress. The previous studies have also attempted to understand the DRAM microarchitectures and associated behaviors, but we have found some of their results to be misled by inaccurate address mapping and internal data swizzling, or lack of a deeper understanding of the modern DRAM cell structure. For accurate and efficient reverse-engineering, we use three tools: AIBs, retention time test, and RowCopy, which can be cross-validated. With these three tools, we first take a macroscopic view of modern DRAM chips to uncover the size, structure, and operation of their subarrays, memory array tiles (MATs), and rows. Then, we analyze AIB characteristics based on the microscopic view of the DRAM microarchitecture, such as 6F^2 cell layout, through which we rectify misunderstandings regarding AIBs and discover a new data pattern that accelerates AIBs. Lastly, based on our findings at both macroscopic and microscopic levels, we identify previously unknown AIB vulnerabilities and propose a simple yet effective protection solution.”

Find the technical paper here. Published May 2024.

Nam, Hwayong, Seungmin Baek, Minbok Wi, Michael Jaemin Kim, Jaehyun Park, Chihun Song, Nam Sung Kim, and Jung Ho Ahn. “DRAMScope: Uncovering DRAM Microarchitecture and Characteristics by Issuing Memory Commands.” arXiv preprint arXiv:2405.02499 (2024).

Related Reading
Securing DRAM Against Evolving Rowhammer Threats
A multi-layered, system-level approach is crucial to DRAM protection.

The post DRAM Microarchitectures And Their Impacts On Activate-Induced Bitflips Such As RowHammer appeared first on Semiconductor Engineering.

Semiconductor Engineering
Competitive Open-Source EDA ToolsTechnical Paper Link
A technical paper titled “Basilisk: Achieving Competitive Performance with Open EDA Tools on an Open-Source Linux-Capable RISC-V SoC” was published by researchers at ETH Zurich and University of Bologna. Abstract: “We introduce Basilisk, an optimized application-specific integrated circuit (ASIC) implementation and design flow building on the end-to-end open-source Iguana system-on-chip (SoC). We present enhancements to synthesis tools and logic optimization scripts improving quality of results
19. Květen 2024 v 22:52

Competitive Open-Source EDA Tools

Semiconductor Engineering

Od: Technical Paper Link

19. Květen 2024 v 22:52

A technical paper titled “Basilisk: Achieving Competitive Performance with Open EDA Tools on an Open-Source Linux-Capable RISC-V SoC” was published by researchers at ETH Zurich and University of Bologna.

Abstract:

“We introduce Basilisk, an optimized application-specific integrated circuit (ASIC) implementation and design flow building on the end-to-end open-source Iguana system-on-chip (SoC). We present enhancements to synthesis tools and logic optimization scripts improving quality of results (QoR), as well as an optimized physical design with an improved power grid and cell placement integration enabling a higher core utilization. The tapeout-ready version of Basilisk implemented in IHP’s open 130 nm technology achieves an operation frequency of 77 MHz (51 logic levels) under typical conditions, a 2.3x improvement compared to the baseline open-source EDA design flow presented in Iguana, and a higher 55% core utilization compared to 50% in the baseline design. Through collaboration with EDA tool developers and domain experts, Basilisk exemplifies a synergistic effort towards competitive open-source electronic design automation (EDA) tools for research and industry applications.”

Find the technical paper here. Published May 2024.

Sauter, Phillippe, Thomas Benz, Paul Scheffler, Zerun Jiang, Beat Muheim, Frank K. Gürkaynak, and Luca Benini. “Basilisk: Achieving Competitive Performance with Open EDA Tools on an Open-Source Linux-Capable RISC-V SoC.” arXiv preprint arXiv:2405.03523 (2024).

Related Reading
EDA Back On Investors’ Radar
Big changes are fueling growth, and it’s showing in EDA revenue, acquisitions, and stock prices.
RISC-V Wants All Your Cores
It is not enough to want to dominate the world of CPUs. RISC-V has every core in its sights, and it’s starting to take steps to get there.

The post Competitive Open-Source EDA Tools appeared first on Semiconductor Engineering.

Semiconductor Engineering
Chip Industry Week In ReviewThe SE Staff
President Biden will raise the tariff rate on Chinese semiconductors from 25% to 50% by 2025, among other measures to protect U.S. businesses from China’s trade practices. Also, as part of President Biden’s AI Executive Order, the Administration released steps to protect workers from AI risks, including human oversight of systems and transparency about what systems are being used. Intel is in advanced talks with Apollo Global Management for the equity firm to provide more than $11 billion to bui
17. Květen 2024 v 09:01

Chip Industry Week In Review

Semiconductor Engineering

Od: The SE Staff

17. Květen 2024 v 09:01

President Biden will raise the tariff rate on Chinese semiconductors from 25% to 50% by 2025, among other measures to protect U.S. businesses from China’s trade practices. Also, as part of President Biden’s AI Executive Order, the Administration released steps to protect workers from AI risks, including human oversight of systems and transparency about what systems are being used.

Intel is in advanced talks with Apollo Global Management for the equity firm to provide more than $11 billion to build a fab in Ireland, reported the Wall Street Journal. Also, Intel’s Foundry Services appointed Kevin O’Buckley as the senior vice president and general manager.

Polar is slated to receive up to $120 million in CHIPS Act funding to establish an independent American foundry in Minnesota. The company expects to invest about $525 million in the expansion of the facility over the next two years, with a $75 million investment from the State of Minnesota.

Arm plans to develop AI chips for launch next year, reports Nikkei Asia.

South Korea is planning a support package worth more than 10 trillion won ($7.3 billion) aimed at chip materials, equipment makers, and fabless companies throughout the semiconductor supply chain, according to Reuters.

Quick links to more news:

Global
In-Depth
Markets and Money
Security
Supercomputing
Education and Training
Product News
Research
Events and Further Reading

Global

Edwards opened a new facility in Asan City, South Korea. The 15,000m² factory provides a key production site for abatement systems, and integrated vacuum and abatement systems for semiconductor manufacturing.

France’s courtship with mega-tech is paying off. Microsoft is investing more than US $4 billion to expand its cloud computing and AI infrastructure, including bringing up to 25,000 advanced GPUs to the country by the end of 2025. The “Choose France” campaign also snagged US $1.3 billion from Amazon for cloud infrastructure expansion, genAI and more.

Toyota, Nissan, and Honda are teaming up on AI and chips for next-gen cars with support from Japan’s Ministry of Economy, Trade and Industry, (METI), reports Nikkei Asia.

Meanwhile, IBM and Honda are collaborating on long-term R&D of next-gen technologies for software-defined vehicles (SDV), including chiplets, brain-inspired computing, and hardware-software co-optimization.

Siemens and Foxconn plan to collaborate on global manufacturing processes in electronics, information and communications technology, and electric vehicles (EV).

TSMC confirmed a Q424 construction start date for its first European plant in Dresden, Germany.

Amazon Web Services (AWS) plans to invest €7.8 billion (~$8.4B) in the AWS European Sovereign Cloud in Germany through 2040. The system is designed to serve public sector organizations and customers in highly regulated industries.

In-Depth

Semiconductor Engineering published its Low Power-High Performance newsletter this week, featuring these stories:

- Will Domain-Specific ICs Become Ubiquitous?
- Running More Efficient AI/ML Code With Neuromorphic Engines
- Power/Performance Costs In Chip Security

And this week’s Test, Measurement & Analytics newsletter featured these stories:

Using Predictive Maintenance To Boost IC Manufacturing Efficiency
The Future Of Fault Coverage In Chips
Doing More At Functional Test

Markets and Money

The U.S. National Institute of Standards and Technology (NIST) awarded more than $1.2 million to 12 businesses in 8 states under the Small Business Innovation Research (SBIR) Program to fund R&D of products relating to cybersecurity, quantum computing, health care, semiconductor manufacturing, and other critical areas.

Engineering services and consulting company Infosys completed the acquisition of InSemi Technology, a provider of semiconductor design and embedded software development services.

The quantum market, which includes quantum networking and sensors alongside computing, is predicted to grow from $838 million in 2024 to $1.8 billion in 2029, reports Yole.

Shipments of OLED monitors reached about 200,000 units in Q1 2024, a year over year growth of 121%, reports TrendForce.

Global EV sales grew 18% in Q1 2024 with plug-in hybrid electric vehicles (PHEV) sales seeing 46% YoY growth and battery electric vehicle (BEV) sales growing just 7%, according to Counterpoint. China leads global EV sales with 28% YoY growth, while the US grew just 2%. Tesla saw a 9% YoY drop, but topped BEV sales with a 19% market share. BYD grew 13% YoY and exported about 100,000 EVs with 152% YoY growth, mainly in Southeast Asia.

DeepX raised $80.5 million in Series C funding for its on-device NPU IP and AI SoCs tailored for applications including physical security, robotics, and mobility.

MetisX raised $44 million in Series A funding for its memory solutions built on Compute Express Link (CXL) for accelerating large-scale data processing applications.

Security

While security experts have been warning of a growing threat in electronics for decades, there have been several recent fundamental changes that elevate the risk.

Synopsys and the Ponemon Institute released a report showing 54% of surveyed organizations suffered a software supply chain attack in the past year and 20% were not effective in their response. And 52% said their development teams use AI tools to generate code, but only 32% have processes to evaluate it for license, security, and quality risks.

Researchers at Ruhr University Bochum and TU Darmstadt presented a solution for the automated generation of fault-resistant circuits (AGEFA) and assessed the security of examples generated by AGEFA against side-channel analysis and fault injection.

TXOne reported on operational technology security and the most effective method for preventing production interruptions caused by cyber-attacks.

CrowdStrike and NVIDIA are collaborating to accelerate the use of analytics and AI in cybersecurity to help security teams combat modern cyberattacks, including AI-powered threats.

The National Institute of Standards and Technology (NIST) finalized its guidelines for protecting sensitive data, known as controlled unclassified information, aimed at organizations that do business with the federal government.

The Defense Advanced Research Projects Agency (DARPA) awarded BAE Systems a $12 million contract to solve thermal challenges limiting electronic warfare systems, particularly in GaN transistors.

Sigma Defense won a $4.7 million contract from the U.S. Army for an AI-powered virtual training environment, partnering with Brightline Interactive on a system that uses spatial computing and augmented intelligence workflows.

SkyWater’s advanced packaging operation in Florida has been accredited as a Category 1A Trusted Supplier by the Defense Microelectronics Activity (DMEA) of the U.S. Department of Defense (DoD).

Videos of two CWE-focused sessions from CVE/FIRST VulnCon 2024 were made available on the CWE YouTube Channel.

The Cybersecurity and Infrastructure Security Agency (CISA) issued a number of alerts/advisories.

Supercomputing

Supercomputers are battling for top dog.

The Frontier supercomputer at Oak Ridge National Laboratory (ORNL) retained the top spot on the Top500 list of the world’s fastest systems with an HPL score of 1.206 EFlop/s. The as-yet incomplete Aurora system at Argonne took second place, becoming the world’s second exascale system at 1.012 EFlop/s. The Green500 list, which tracks energy efficiency of compute, saw three new entrants take the top places.

Cerebras Systems, Sandia National Laboratory, Lawrence Livermore National Laboratory, and Los Alamos National Laboratory used Cerebras’ second generation Wafer Scale Engine to perform atomic scale molecular dynamics simulations at the millisecond scale, which they claim is 179X faster than the Frontier supercomputer.

UT Austin‘s Stampede3 Supercomputer is now in full production, serving the open science community through 2029.

Education and Training

SEMI announced the SEMI University Semiconductor Certification Programs to help alleviate the workforce skills gap. Its first two online courses are designed for new talent seeking careers in the industry, and experienced workers looking to keep their skills current. Also, SEMI and other partners launched a European Chip Skills Academy Summer School in Italy.

Siemens created an industry credential program for engineering students that supplements a formal degree by validating industry knowledge and skills. Nonprofit agency ABET will provide accreditation. The first two courses are live at the University of Colorado Boulder (CU Boulder) and a series is planned with Pennsylvania State University (Penn State).

Syracuse University launched a $20 million Center for Advanced Semiconductor Manufacturing, with co-funding from Onondaga County.

Starting young is a good thing. An Arizona school district, along with the University Of Arizona, is creating a semiconductor program for high schoolers.

Product News

Siemens and Sony partnered to enable immersive engineering via a spatial content creation system, NX Immersive Designer, which includes Sony’s XR head-mounted display. The integration of hardware and software gives designers and engineers natural ways to interact with a digital twin. Siemens also extended its Xcelerator as a Service portfolio with solutions for product engineering and lifecycle management, cloud-based high-performance simulation, and manufacturing operations management. It will be available on Microsoft Azure, as well.

Advantest announced the newest addition to its portfolio of power supplies for the V93000 EXA Scale SoC test platform. The DC Scale XHC32 power supply offers 32 channels with single-instrument total current of up to 640A.

Fig. 1: Advantest’s DC Scale XHC32. Source: Advantest

Infineon released its XENSIV TLE49SR angle sensors, which can withstand stray magnetic fields of up to 8 mT, ideal for applications of safety-critical automotive chassis systems.

Google debuted its sixth generation Cloud TPU, 4.7X faster and 67% more energy-efficient than the previous generation, with double the high-bandwidth memory.

X-Silicon uncorked a RISC-V vector CPU, coupled with a Vulkan-enabled GPU ISA and AI/ML acceleration in a single processor core, aimed at embedded and IoT applications.

IBM expanded its Qiskit quantum software stack, including the stable release of its SDK for building, optimizing, and visualizing quantum circuits.

Northeastern University announced the general availability of testing and integration solutions for Open RAN through the Open6G Open Testing and Integration Center (Open 6G OTIC).

Research

The University of Glasgow received £3 million (~$3.8M) from the Engineering and Physical Sciences Research Council (EPSRC)’s Strategic Equipment Grant scheme to help establish “Analogue,” an Automated Nano Analysing, Characterisation and Additive Packaging Suite to research silicon chip integration and packaging.

EPFL researchers developed scalable photonic ICs, based on lithium tantalate.

DISCO developed a way to increase the diameter of diamond wafers that uses the KABRA process, a laser ingot slicing method.

CEA-Leti developed two complementary approaches for high performance photon detectors — a mercury cadmium telluride-based avalanche photodetector and a superconducting single photon detector.

Toshiba demonstrated storage capacities of over 30TB with two next-gen large capacity recording technologies for hard disk drives (HDDs): Heat Assisted Magnetic Recording (HAMR) and Microwave Assisted Magnetic Recording (MAMR).

Caltech neuroscientists reported that their brain-machine interface (BMI) worked successfully in a second human patient, following 2022’s first instance, proving the device is not dependent on one particular brain or one location in a brain.

Linköping University researchers developed a cheap, sustainable battery made from zinc and lignin, while ORNL researchers developed carbon-capture batteries.

Events and Further Reading

Find upcoming chip industry events here, including:

Event	Date	Location
European Test Symposium	May 20 – 24	The Hague, Netherlands
NI Connect Austin 2024	May 20 – 22	Austin, Texas
ITF World 2024 (imec)	May 21 – 22	Antwerp, Belgium
Embedded Vision Summit	May 21 – 23	Santa Clara, CA
ASIP Virtual Seminar 2024	May 22	Online
Electronic Components and Technology Conference (ECTC) 2024	May 28 – 31	Denver, Colorado
Hardwear.io Security Trainings and Conference USA 2024	May 28 – Jun 1	Santa Clara, CA
SW Test	Jun 3 – 5	Carlsbad, CA
IITC2024: Interconnect Technology Conference	Jun 3 – 6	San Jose, CA
VOICE Developer Conference	Jun 3 – 5	La Jolla, CA
CHIPS R&D Standardization Readiness Level Workshop	Jun 4 – 5	Online and Boulder, CO

Find All Upcoming Events Here

Upcoming webinars are here.

Semiconductor Engineering’s latest newsletters:

Automotive, Security and Pervasive Computing
Systems and Design
Low Power-High Performance
Test, Measurement and Analytics
Manufacturing, Packaging and Materials

The post Chip Industry Week In Review appeared first on Semiconductor Engineering.

Semiconductor Engineering
Efficient ElectronicsAndy Heinig
Attention nowadays has turned to the energy consumption of systems that run on electricity. At the moment, the discussion is focused on electricity consumption in data centers: if this continues to rise at its current rate, it will account for a significant proportion of global electricity consumption in the future. Yet there are other, less visible electricity consumers whose power needs are also constantly growing. One example is mobile communications, where ongoing expansion – especially with
16. Květen 2024 v 09:07

Efficient Electronics

Semiconductor Engineering

Od: Andy Heinig

16. Květen 2024 v 09:07

Attention nowadays has turned to the energy consumption of systems that run on electricity. At the moment, the discussion is focused on electricity consumption in data centers: if this continues to rise at its current rate, it will account for a significant proportion of global electricity consumption in the future. Yet there are other, less visible electricity consumers whose power needs are also constantly growing. One example is mobile communications, where ongoing expansion – especially with the new current 5G standard and the future 6G standard – is pushing up the number of base stations required. This, too, will drive up electricity demand, as the latter increases linearly with the number of stations; at least, if the demand per base station is not reduced. Another example is electronics for the management of household appliances and in the industrial sector: more and more such systems are being installed, and their electronics are becoming significantly more powerful. They are not currently optimized for power consumption, but rather for performance.

This state of affairs simply cannot continue into the future for two reasons: first, the price of electricity will continue to rise worldwide; and second, many companies are committed to becoming carbon neutral. Their desire for carbon neutrality in turn makes electricity yet more expensive and restricts the overall quantity much more severely. As a result, there will be a significant demand for efficient electronics in the coming years, particularly as regards electricity consumption.

This development is already evident today, especially in power electronics, where the use of new semiconductor materials such as GaN or SiC has made it possible to reduce power consumption. A key driver for the development and introduction of such new materials was the electric car market, as reduced losses in the electronics leads directly to increased vehicle range. In the future, these materials will also find their way into other areas; for instance, they are already beginning to establish themselves in voltage transformers in various industries. However, this shift requires more factories and more suppliers for production, and further work also needs to be carried out to develop appropriate circuit concepts for these technologies.

In addition to the use of new materials, other concepts to reduce energy consumption are needed. The data center sector will require increasingly better-adapted circuits – ones that have been developed for a specific task, and as a result can perform this task much more efficiently than universal processors. This involves striking the optimum balance between universal architectures, such as microprocessors and graphics cards, and highly specialized architectures that are suitable for only one use case. Some products will also fall between these two extremes. The increased energy efficiency is then “purchased” through the effort and expense of developing exceptionally specially adapted architectures. It’s important to note that the more specialized an adapted architecture is, the smaller the market for it. That means the only way such architectures will be economically viable is if they can be developed efficiently. This calls for new approaches to derive these architectures directly from high-level hardware/software optimization, without the additional implementation steps that are still necessary today. In sum, the only way to make this approach possible is by using novel concepts and tools to generate circuits directly from a high-level description.

The post Efficient Electronics appeared first on Semiconductor Engineering.

Semiconductor Engineering
How To Successfully Deploy GenAI On Edge DevicesGordon Cooper
Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking for ways to improve their own products or customer experiences with this innovative technology. Since the ‘cost’
16. Květen 2024 v 09:06

How To Successfully Deploy GenAI On Edge Devices

Semiconductor Engineering

Od: Gordon Cooper

16. Květen 2024 v 09:06

Generative AI (GenAI) burst onto the scene and into the public’s imagination with the launch of ChatGPT in late 2022. Users were amazed at the natural language processing chatbot’s ability to turn a short text prompt into coherent humanlike text including essays, language translations, and code examples. Technology companies – impressed with ChatGPT’s abilities – have started looking for ways to improve their own products or customer experiences with this innovative technology. Since the ‘cost’ of adding GenAI includes a significant jump in computational complexity and power requirements versus previous AI models, can this class of AI algorithms be applied to practical edge device applications where power, performance and cost are critical? It depends.

What is GenAI?

A simple definition of GenAI is ‘a class of machine learning algorithms that can produce various types of content including human like text and images.’ Early machine learning algorithms focused on detecting patterns in images, speech or text and then making predictions based on the data. For example, predicting the percentage likelihood that a certain image included a cat. GenAI algorithms take the next step – they perceive and learn patterns and then generate new patterns on demand by mimicking the original dataset. They generate a new image of a cat or describe a cat in detail.

While ChatGPT might be the most well-known GenAI algorithm, there are many available, with more being released on a regular basis. Two major types of GenAI algorithms are text-to-text generators – aka chatbots – like ChatGPT, GPT-4, and Llama2, and text-to-image generative model like DALLE-2, Stable Diffusion, and Midjourney. You can see example prompts and their returned outputs of these two types of GenAI models in figure 1. Because one is text based and one is image based, these two types of outputs will demand different resources from edge devices attempting to implement these algorithms.

Fig. 1: Example GenAI outputs from a text-to-image generator (DALLE-2) and a text-to-text generator (ChatGPT).

Edge device applications for Gen AI

Common GenAI use cases require connection to the internet and from there access to large server farms to compute the complex generative AI algorithms. However, for edge device applications, the entire dataset and neural processing engine must reside on the individual edge device. If the generative AI models can be run at the edge, there are potential use cases and benefits for applications in automobiles, cameras, smartphones, smart watches, virtual and augmented reality, IoT, and more.

Deploying GenAI on edge devices has significant advantages in scenarios where low latency, privacy or security concerns, or limited network connectivity are critical considerations.

Consider the possible application of GenAI in automotive applications. A vehicle is not always in range of a wireless signal, so GenAI needs to run with resources available on the edge. GenAI could be used for improving roadside assistance and converting a manual into an AI-enhanced interactive guide. In-car uses could include a GenAI-powered virtual voice assistant, improving the ability to set navigation, play music or send messages with your voice while driving. GenAI could also be used to personalize your in-cabin experience.

Other edge applications could benefit from generative AI. Augmented Reality (AR) edge devices could be enhanced by locally generating overlay computer-generated imagery and relying less heavily on cloud processing. While connected mobile devices can use generative AI for translation services, disconnected devices should be able to offer at least a portion of the same capabilities. Like our automotive example, voice assistant and interactive question-and-answer systems could benefit a range of edge devices.

While uses cases for GenAI at the edge exist now, implementations must overcome the challenges related to computational complexity and model size and limitations of power, area, and performance inherent in edge devices.

What technology is required to enable GenAI?

To understand GenAI’s architectural requirements, it is helpful to understand its building blocks. At the heart of GenAI’s rapid development are transformers, a relatively new type of neural network introduced in a Google Brain paper in 2017. Transformers have outperformed established AI models like Recurrent Neural Networks (RNNs) for natural language processing and Convolutional Neural Networks (CNNs) for images, video or other two- or three-dimensional data. A significant architectural improvement of a transformer model is its attention mechanism. Transformers can pay more attention to specific words or pixels than legacy AI models, drawing better inferences from the data. This allows transformers to better learn contextual relationships between words in a text string compared to RNNs and to better learn and express complex relationships in images compared to CNNs.

Fig. 2: Parameter sizes for various machine learning algorithms.

GenAI models are pre-trained on vast amounts of data which allows them to better recognize and interpret human language or other types of complex data. The larger the datasets, the better the model can process human language, for instance. Compared to CNN or vision transformer machine learning models, GenAI algorithms have parameters – the pretrained weights or coefficients used in the neural network to identify patterns and create new ones – that are orders of magnitude larger. We can see in figure 2 that ResNet50 – a common CNN algorithm used for benchmarking – has 25 million parameters (or coefficients). Some transformers like BERT and Vision Transformer (ViT) have parameters in the hundreds of millions. While other transformers, like Mobile ViT, have been optimized to better fit in embedded and mobile applications. MobileViT is comparable to the CNN model MobileNet in parameters.

Compared to CNN and vision transformers, ChatGPT requires 175 billion parameters and GPT-4 requires 1.75 trillion parameters. Even GPUs implemented in server farms struggle to execute these high-end large language models. How could an embedded neural processing unit (NPU) hope to complete so many parameters given the limited memory resources of edge devices? The answer is they cannot. However, there is a trend toward making GenAI more accessible in edge device applications, which have more limited computation resources. Some LLM models are tuned to reduce the resource requirements for a reduced parameter set. For example, Llama-2 offers a 70 billion parameter version of their model, but they also have created smaller models with fewer parameters. Llama-2 with seven billion parameters is still large, but it is within reach of a practical embedded NPU implementation.

There is no hard threshold for generative AI running on the edge, however, text-to-image generators like Stable Diffusion with one billion parameters can run comfortably on an NPU. And the expectation is for edge devices to run LLMs up to six to seven billion parameters. MLCommons have added GPT-J, a six billion parameter GenAI model, to their MLPerf edge AI benchmark list.

Running GenAI on the edge

GenAI algorithms require a significant amount of data movement and computation complexity (with transformer support). The balance of those two requirements can determine whether a given architecture is compute-bound – not enough multiplications for the data available – or memory bound – not enough memory and/or bandwidth for all the multiplications required for processing. Text-to-image has a better mix of compute and bandwidth requirements – more computations needed for processing two dimensional images and fewer parameters (in the one billion range). Large language models are more lopsided. There is less compute required, but a significantly large amount of data movement. Even the smaller (6-7B parameter) LLMs are memory bound.

The obvious solution is to choose the fastest memory interface available. From figure 3, you can see that a typically memory used in edge devices, LPDDR5, has a bandwidth of 51 Gbps, while HBM2E can support up to 461 Gbps. This does not, however, take into consideration the power-down benefits of LPDDR memory over HBM. While HBM interfaces are often used in high-end server-type AI implementations, LPDDR is almost exclusively used in power sensitive applications because of its power down abilities.

Fig. 3: The bandwidth and power difference between LPDDR and HBM.

Using LPDDR memory interfaces will automatically limit the maximum data bandwidth achievable with an HBM memory interface. That means edge applications will automatically have less bandwidth for GenAI algorithms than an NPU or GPU used in a server application. One way to address bandwidth limitations is to increase the amount of on-chip L2 memory. However, this impacts area and, therefore, silicon cost. While embedded NPUs often implement hardware and software to reduce bandwidth, it will not allow an LPDDR to approach HBM bandwidths. The embedded AI engine will be limited to the amount of LPDDR bandwidth available.

Implementation of GenAI on an NPX6 NPU IP

The Synopsys ARC NPX6 NPU IP family is based on a sixth-generation neural network architecture designed to support a range of machine learning models including CNNs and transformers. The NPX6 family is scalable with a configurable number of cores, each with its own independent matrix multiplication engine, generic tensor accelerator (GTA), and dedicated direct memory access (DMA) units for streamlined data processing. The NPX6 can scale for applications requiring less than one TOPS of performance to those requiring thousands of TOPS using the same development tools to maximize software reuse.

The matrix multiplication engine, GTA and DMA have all been optimized for supporting transformers, which allow the ARC NPX6 to support GenAI algorithms. Each core’s GTA is expressly designed and optimized to efficiently perform nonlinear functions, such as ReLU, GELU, sigmoid. These are implemented using a flexible lookup table approach to anticipate future nonlinear functions. The GTA also supports other critical operations, including SoftMax and L2 normalization needed in transformers. Complementing this, the matrix multiplication engine within each core can perform 4,096 multiplications per cycle. Because GenAI is based on transformers, there are no computation limitations for running GenAI on the NPX6 processor.

Efficient NPU design for transformer-based models like GenAI requires complex multi-level memory management. The ARC NPX6 processor has a flexible memory hierarchy and can support a scalable L2 memory up to 64MB of on chip SRAM. Furthermore, each NPX6 core is equipped with independent DMAs dedicated to the tasks of fetching feature maps and coefficients and writing new feature maps. This segregation of tasks allows for an efficient, pipelined data flow that minimizes bottlenecks and maximizes the processing throughput. The family also has a range of bandwidth reduction techniques in hardware and software to maximize bandwidth.

In an embedded GenAI application, the ARC NPX6 family will only be limited by the LPDDR available in the system. The NPX6 successfully runs Stable Diffusion (text-to-image) and Llama-2 7B (text-to-text) GenAI algorithms with efficiency dependent on system bandwidth and the use of on-chip SRAM. While larger GenAI models could run on the NPX6, they will be slower – measured in tokens per second – than server implementations. Learn more at www.synopsys.com/npx

The post How To Successfully Deploy GenAI On Edge Devices appeared first on Semiconductor Engineering.

Semiconductor Engineering
Will Domain-Specific ICs Become Ubiquitous?Brian Bailey
Questions are surfacing for all types of design, ranging from small microcontrollers to leading-edge chips, over whether domain-specific design will become ubiquitous, or whether it will fall into the historic pattern of customization first, followed by lower-cost, general-purpose components. Custom hardware always has been a double-edged sword. It can provide a competitive edge for chipmakers, but often requires more time to design, verify, and manufacture a chip, which can sometimes cost a mar
16. Květen 2024 v 09:05

Will Domain-Specific ICs Become Ubiquitous?

Semiconductor Engineering

Od: Brian Bailey

16. Květen 2024 v 09:05

Questions are surfacing for all types of design, ranging from small microcontrollers to leading-edge chips, over whether domain-specific design will become ubiquitous, or whether it will fall into the historic pattern of customization first, followed by lower-cost, general-purpose components.

Custom hardware always has been a double-edged sword. It can provide a competitive edge for chipmakers, but often requires more time to design, verify, and manufacture a chip, which can sometimes cost a market window. In addition, it’s often too expensive for all but the most price-resilient applications. This is a well-understood equation at the leading edge of design, particularly where new technologies such as generative AI are involved.

But with planar scaling coming to an end, and with more features tailored to specific domains, the chip industry is struggling to figure out whether the business/technical equation is undergoing a fundamental and more permanent change. This is muddied further by the fact that some 30% to 35% of all design tools today are being sold to large systems companies for chips that will never be sold commercially. In those applications, the collective savings from improved performance per watt may dwarf the cost of designing, verifying, and manufacturing a highly optimized multi-chip/multi-chiplet package across a large data center, leaving the debate about custom vs. general-purpose more uncertain than ever.

“If you go high enough in the engineering organization, you’re going to find that what people really want to do is a software-defined whatever it is,” says Russell Klein, program director for high-level synthesis at Siemens EDA. “What they really want to do is buy off-the-shelf hardware, put some software on it, make that their value-add, and ship that. That paradigm is breaking down in a number of domains. It is breaking down where we need either extremely high performance, or we need extreme efficiency. If we need higher performance than we can get from that off-the-shelf system, or we need greater efficiency, we need the battery to last longer, or we just can’t burn as much power, then we’ve got to start customizing the hardware.”

Even the selection of processing units can make a solution custom. “Domain-specific computing is already ubiquitous,” says Dave Fick, CEO and cofounder of Mythic. “Modern computers, whether in a laptop, phone, security camera, or in farm equipment, consist of a mix of hardware blocks co-optimized with software. For instance, it is common for a computer to have video encode or decode hardware units to allow a system to connect to a camera efficiently. It is common to have accelerators for encryption so that we can safely communicate. Each of these is co-optimized with software algorithms to make commonly used functions highly efficient and flexible.”

Steve Roddy, chief marketing officer at Quadric, agrees. “Heterogeneous processing in SoCs has been de rigueur in the vast majority of consumer applications for the past two decades or more. SoCs for mobile phones, tablets, televisions, and automotive applications have long been required to meet a grueling combination of high-performance plus low-cost requirements, which has led to the proliferation of function-specific processors found in those systems today. Even low-cost SoCs for mobile phones today have CPUs for running Android, complex GPUs to paint the display screen, audio DSPs for offloading audio playback in a low-power mode, video DSPs paired with NPUs in the camera subsystem to improve image capture (stabilization, filters, enhancement), baseband DSPs — often with attached NPUs — for high speed communications channel processing in the Wi-Fi and 5G subsystems, sensor hub fusion DSPs, and even power-management processors that maximize battery life.”

It helps to separate what you call general-purpose and what is application-specific. “There is so much benefit to be had from running your software on dedicated hardware, what we call bespoke silicon, because it gives you an advantage over your competitors,” says Marc Swinnen, director of product marketing in Ansys’ Semiconductor Division. “Your software runs faster, lower power, and is designed to run specifically what you want to run. It’s hard for a competitor with off-the-shelf hardware to compete with you. Silicon has become so central to the business value, the business model, of many companies that it has become important to have that optimized.”

There is a balance, however. “If there is any cost justification in terms of return on investment and deployment costs, power costs, thermal costs, cooling costs, then it always makes sense to build a custom ASIC,” says Sharad Chole, chief scientist and co-founder of Expedera. “We saw that for cryptocurrency, we see that right now for AI. We saw that for edge computing, which requires extremely ultra-low power sensors and ultra-low power processes. But there also has been a push for general-purpose computing hardware, because then you can easily make the applications more abstract and scalable.”

Part of the seeming conflict is due to the scope of specificity. “When you look at the architecture, it’s really the scope that determines the application specificity,” says Frank Schirrmeister, vice president of solutions and business development at Arteris. “Domain-specific computing is ubiquitous now. The important part is the constant moving up of the domain specificity to something more complex — from the original IP, to configurable IP, to subsystems that are configurable.”

In the past, it has been driven more by economics. “There’s an ebb and a flow to it,” says Paul Karazuba, vice president of marketing at Expedera. “There’s an ebb and a flow to putting everything into a processor. There’s an ebb and a flow to having co-processors, augmenting functions that are inside of that main processor. It’s a natural evolution of pretty much everything. It may not necessarily be cheaper to design your own silicon, but it may be more expensive in the long run to not design your own silicon.”

An attempt to formalize that ebb and flow was made by Tsugio Makimoto in the 1990s, when he was Sony’s CTO. He observed that electronics cycled between custom solutions and programmable ones approximately every 10 years. What’s changed is that most custom chips from the time of his observation contained highly programmable standard components.

Technology drivers
Today, it would appear that technical issues will decide this. “The industry has managed to work around power issues and push up the thermal envelope beyond points I personally thought were going to be reasonable, or feasible,” says Elad Alon, co-founder and CEO of Blue Cheetah. “We’re hitting that power limit, and when you hit the power limit it drives you toward customization wherever you can do it. But obviously, there is tension between flexibility, scalability, and applicability to the broadest market possible. This is seen in the fast pace of innovation in the AI software world, where tomorrow there could be an entirely different algorithm, and that throws out almost all the customizations one may have done.”

The slowing of Moore’s Law will have a fundamental influence on the balance point. “There have been a number of bespoke silicon companies in the past that were successful for a short period of time, but then failed,” says Ansys’ Swinnen. “They had made some kind of advance, be it architectural or addressing a new market need, but then the general-purpose chips caught up. That is because there’s so much investment in them, and there’s so many people using them, there’s an entire army of people advancing, versus your company, just your team, that’s advancing your bespoke solution. Inevitably, sooner or later, they bypass you and the general-purpose hardware just gets better than the specific one. Right now, the pendulum has swung toward custom solutions being the winner.”

However, general-purpose processors do not automatically advance if companies don’t keep up with adoption of the latest nodes, and that leads to even more opportunities. “When adding accelerators to a general-purpose processor starts to break down, because you want to go faster or become more efficient, you start to create truly customized implementations,” says Siemens’ Klein. “That’s where high-level synthesis starts to become really interesting, because you’ve got that software-defined implementation as your starting point. We can take it through high-level synthesis (HLS) and build an accelerator that’s going to do that one specific thing. We could leave a bunch of registers to define its behavior, or we can just hard code everything. The less general that system is, the more specific it is, usually the higher performance and the greater efficiency that we’re going to take away from it. And it almost always is going to be able to beat a general-purpose accelerator or certainly a general-purpose processor in terms of both performance and efficiency.”

At the same time, IP has become massively configurable. “There used to be IP as the building blocks,” says Arteris’ Schirrmeister. “Since then, the industry has produced much larger and more complex IP that takes on the role of sub-systems, and that’s where scope comes in. We have seen Arm with what they call the compute sub-systems (CSS), which are an integration and then hardened. People care about the chip as a whole, and then the chip and the system context with all that software. Application specificity has become ubiquitous in the IP space. You either build hard cores, you use a configurable core, or you use high-level synthesis. All of them are, by definition, application-specific, and the configurability plays in there.”

Put in perspective, there is more than one way to build a device, and an increasing number of options for getting it done. “There’s a really large market for specialized computing around some algorithm,” says Klein. “IP for that is going to be both in the form of discrete chips, as well as IP that could be built into something. Ultimately, that has to become silicon. It’s got to be hardened to some degree. They can set some parameters and bake it into somebody’s design. Consider an Arm processor. I can configure how many CPUs I want, I can configure how big I want the caches, and then I can go bake that into a specific implementation. That’s going to be the thing that I build, and it’s going to be more targeted. It will have better efficiency and a better cost profile and a better power profile for the thing that I’m doing. Somebody else can take it and configure it a little bit differently. And to the degree that the IP works, that’s a great solution. But there will always be algorithms that don’t have a big enough market for IP to address. And that’s where you go in and do the extreme customization.”

Chiplets
Some have questioned if the emerging chiplet industry will reverse this trend. “We will continue to see systems composed of many hardware accelerator blocks, and advanced silicon integration technologies (i.e., 3D stacking and chiplets) will make that even easier,” says Mythic’s Fick. “There are many companies working on open standards for chiplets, enabling communication bandwidth and energy efficiency that is an order of magnitude greater than what can be built on a PCB. Perhaps soon, the advanced system-in-package will overtake the PCB as the way systems are designed.”

Chiplets are not likely to be highly configurable. “Configuration in the chiplet world might become just a function of switching off things you don’t need,” says Schirrmeister. “Configuration really means that you do not use certain things. You don’t get your money back for those items. It’s all basically applying math and predicting what your volumes are going to be. If it’s an incremental cost that has one more block on it to support another interface, or making the block the Ethernet block with time triggered stuff in it for automotive, that gives you an incremental effort of X. Now, you have to basically estimate whether it also gives you a multiple of that incremental effort as incremental profit. It works out this way because chips just become very configurable. Chiplets are just going in the direction or finding the balance of more generic usage so that you can apply them in more chiplet designs.”

The chiplet market is far from certain today. “The promise of chiplets is that you use only the function that you want from the supplier that you want, in the right node, at the right location,” says Expedera’s Karazuba. “The idea of specialization and chiplets are at arm’s length. They’re actually together, but chiplets have a long way to go. There’s still not that universal agreement of the different things around a chiplet that have to be in order to make the product truly mass market.”

While chiplets have been proven to work, nearly all of the chiplets in use today are proprietary. “To build a viable [commercial] chiplet company, you have to be going after a broad enough market, large enough from a dollar perspective, then you can make all the investment, have success and get everything back accordingly,” says Blue Cheetah’s Alon. “There’s a similar tension where people would like to build a general-purpose chiplet that can be used anywhere, by anyone. That is the plug-and-play discussion, but you could finish up with something that becomes so general-purpose, with so much overhead, that it’s just not attractive in any particular market. In the chiplet case, for technical reasons, it might not actually really work that way at all. You might try to build it for general purpose, and it turns out later that it doesn’t plug into particular sockets that are of interest.”

The economics of chiplet viability have not yet been defined. “The thing about chiplets is they can be small,” says Klein. “Being small means that we don’t need as big a market for them as we would for a very large chip. We can also build them on different technologies. We can have some that are on older technologies, where transistors are cheaper, and we can combine those with other chiplets that might be leading-edge nodes where we could have general-purpose CPUs or NPU accelerators. There’s a mix-and-match, and we can do chiplets smaller than we can general-purpose chips. We can do smaller runs of them. We can take that IP and customize it for a particular market vertical and create some chiplets for that, change the configuration a bit, and do another run for something else. There’s a level of customization that can be deployed and supported by the market that’s a little bit more than we’ve seen in full-size chips, where the entire thing has to be built into one package.

Conclusion
What it means for a design to be general-purpose or custom is changing. All designs will contain some of each. Some companies will develop novel architectures using general-purpose processors, and these will be better than a fully general-purpose solution. Others will create highly customized hardware for some functions that are known to be stable, and general purpose for things that are likely to change. One thing has never changed, however. A company is not likely to add more customization than necessary to satisfy the needs of the market they are targeting.

Further Reading
Challenges With Chiplets And Power Delivery
Benefits and challenges in heterogeneous integration.
Chiplets: 2023 (EBook)
What chiplets are, what they are being used for today, and what they will be used for in the future.

The post Will Domain-Specific ICs Become Ubiquitous? appeared first on Semiconductor Engineering.

Semiconductor Engineering
DDR5 PMICs Enable Smarter, Power-Efficient Memory ModulesTim Messegee
Power management has received increasing focus in microelectronic systems as the need for greater power density, efficiency and precision have grown apace. One of the important ongoing trends in service of these needs has been the move to localizing power delivery. To optimize system power, it’s best to deliver as high a voltage as possible to the endpoint where the power is consumed. Then at the endpoint, that incoming high voltage can be regulated into the lower voltages with higher currents r
16. Květen 2024 v 09:05

DDR5 PMICs Enable Smarter, Power-Efficient Memory Modules

Semiconductor Engineering

Od: Tim Messegee

16. Květen 2024 v 09:05

Power management has received increasing focus in microelectronic systems as the need for greater power density, efficiency and precision have grown apace. One of the important ongoing trends in service of these needs has been the move to localizing power delivery. To optimize system power, it’s best to deliver as high a voltage as possible to the endpoint where the power is consumed. Then at the endpoint, that incoming high voltage can be regulated into the lower voltages with higher currents required by the endpoint components.

We saw this same trend play out in the architecting of the DDR5 generation of computer main memory. In planning for DDR5, the industry laid out ambitious goals for memory bandwidth and capacity. Concurrently, the aim was to maintain power within the same envelope as DDR4 on a per module basis. In order to achieve these goals, DDR5 required a smarter DIMM architecture; one that would embed more intelligence in the DIMM and increase its power efficiency. One of the biggest architectural changes of this smarter DIMM architecture was moving power management from the motherboard to an on-module Power Management IC (PMIC) on each DDR5 RDIMM.

In previous DDR generations, the power regulator on the motherboard had to deliver a low voltage at high current across the motherboard, through a connector and then onto the DIMM. As supply voltages were reduced over time (to maintain power levels at higher data rates), it was a growing challenge to maintain the desired voltage level because of IR drop. By implementing a PMIC on the DDR5 RDIMM, the problem with IR drop was essentially eliminated.

In addition, the on-DIMM PMIC allows for very fine-grain control of the voltage levels supplied to the various components on the DIMM. As such, DIMM suppliers can really dial in the best power levels for the performance target of a particular DIMM configuration. On-DIMM PMICs also offered an economic benefit. Power management on the motherboard meant the regulator had to be designed to support a system with fully populated DIMMs. On-DIMM PMICs means only paying for the power management capacity you need to support your specific system memory configuration.

The upshot is that power management has become a major enabler of increasing memory performance. Advancing memory performance has been the mission of Rambus for nearly 35 years. We’re intimate with memory subsystem design on modules, with expertise across many critical enabling technologies, and have demonstrated the disciplines required to successfully develop chips for the challenging module environment with its increased power density, space constraints and complex thermal management challenges.

As part of the development of our DDR5 memory interface chipset, Rambus built a world-class power management team and has now introduced a new family of DDR5 server PMICs. This new server PMIC product family lays the foundation for a roadmap of future power management chips. As AI continues to expand from training to inference, increasing demands on memory performance will extend beyond servers to client systems and drive the need for new PMIC solutions tailored for emerging use cases and form factors across the computing landscape.

Resources:

The post DDR5 PMICs Enable Smarter, Power-Efficient Memory Modules appeared first on Semiconductor Engineering.

Semiconductor Engineering
How Quickly Can You Take Your Idea To Chip Design?Kira Jones
Gone are the days of expensive tapeouts only done by commercial companies. Thanks to Tiny Tapeout, students, hobbyists, and more can design a simple ASIC or PCB design and actually send it to a foundry for a small fraction of the usual cost. Learners from all walks of life can use the resources to learn how to design a chip, without signing an NDA or installing licenses, faster than ever before. Whether you’re a digital, analog, or mixed-signal designer, there’s support for you. We’re excited to
16. Květen 2024 v 09:04

How Quickly Can You Take Your Idea To Chip Design?

Semiconductor Engineering

Od: Kira Jones

16. Květen 2024 v 09:04

Gone are the days of expensive tapeouts only done by commercial companies. Thanks to Tiny Tapeout, students, hobbyists, and more can design a simple ASIC or PCB design and actually send it to a foundry for a small fraction of the usual cost. Learners from all walks of life can use the resources to learn how to design a chip, without signing an NDA or installing licenses, faster than ever before. Whether you’re a digital, analog, or mixed-signal designer, there’s support for you.

We’re excited to support our academic network in participating in this initiative to gain more hands-on experience that will prepare them for a career in the semiconductor industry. We have professors incorporating it into the classroom, giving students the exciting opportunity to take their ideas from concept to reality.

“It gives people this joy when we design the chip in class. The 50 students that took the class last year, they designed a chip and Google funded it, and every time they got their design on the chip, their eyes got really big. I love being able to help students do that, and I want to do that all over the country,” said Matt Morrison, associate teaching professor in computer science and engineering, University of Notre Dame.

We also advise and encourage extracurricular design teams to challenge themselves outside the classroom. We partner with multiple design teams focused on creating an environment for students to explore the design flow process from RTL-to-GDS, and Tiny Tapeout provides an avenue for them.

“Just last year, SiliconJackets was founded by Zachary Ellis and me as a Georgia Tech club that takes ideas to SoC tapeout. Today, I am excited to share that we submitted the club’s first-ever design to Tiny Tapeout 6. This would not have been possible without the help from our advisors, and industry partners at Apple and Cadence,” said Nealson Li, SiliconJackets vice president and co-founder.

Tiny Tapeout also creates a culture of knowledge sharing, allowing participants to share their designs online, collaborate with one another, and build off an existing design. This creates a unique opportunity to learn from others’ experiences, enabling faster learning and more exposure.

“One of my favorite things about this project is that you’re not only going to get your design, but everybody else’s as well. You’ll be able to look through the chips’ data sheet and try out someone else’s design. In our previous runs, we’ve seen some really interesting designs, including RISC-V CPUs, FPGAs, ring oscillators, synthesizers, USB devices, and loads more,” said Matt Venn, science & technology communicator and electronic engineer.

Tiny Tapeout is on its seventh run, launched on April 22, 2024, and will remain open until June 1, 2024, or until all the slots fill up! With each run, more unique designs are created, more knowledge is shared, and more of the future workforce is developed. Check out the designs that were just submitted for Tiny Tapeout 6.

What can you expect when you participate?

Access to training materials

Ability to create your own design using one of the templates

Support from the FAQs or Tiny Tapeout community

Interested in learning more? Check out their webpage. Want to use Cadence tools for your design? Check out our University Program and what tools students can access for free. We can’t wait to see what you come up with and how it’ll help you launch a career in the electronics industry!

The post How Quickly Can You Take Your Idea To Chip Design? appeared first on Semiconductor Engineering.