Zobrazení pro čtení

Jsou dostupné nové články, klikněte pro obnovení stránky.

Turbocharging Cost-Conscious SoCs With Cache

Od: John Min

30. Květen 2024 v 09:04

Some design teams creating system-on-chip (SoC) devices are fortunate to work with the latest and greatest technology nodes coupled with a largely unconstrained budget for acquiring intellectual property (IP) blocks from trusted third-party vendors. However, many engineers are not so privileged. For every “spare no expense” project, there are a thousand “do the best you can with a limited budget” counterparts.

One way to squeeze the most performance out of lower-cost, earlier generation, mid-range processor and accelerator cores is to employ the judicious application of caches.

Cutting costs

A simplified example of a typical cost-conscious SoC scenario is illustrated in figure 1. Although the SoC may be composed of many IPs, only three are shown here for clarity.

Fig. 1: Portion of a cost-conscious, non-cache-coherent SoC. (Source: Arteris)

The predominant technology for connecting the IPs inside an SoC is network-on-chip (NoC) interconnect IP. This may be thought of as an IP that spans the entire device. The example shown in figure 1 may be assumed to reflect a non-cache-coherent scenario. In this case, any coherency requirements will be handled in software.

Let’s assume the SoC’s clock is running at 1GHz. Suppose a central processing unit (CPU) based on a reduced instruction set computer (RISC) architecture running a typical instruction will consume a single clock cycle. However, access to external DRAM memory can take anywhere between 100 and 200 processor clock cycles (we’ll average this out to be 150 cycles for the purposes of this article). This means that if the CPU lacked a Level 1 (L1) cache and was connected directly to the DRAM via the NoC and DDR memory controller, each instruction would consume 150 processor clock cycles, resulting in a CPU utilization of only 1/150 = 0.67%.

This is why CPUs, along with some accelerators and other IPs, employ cache memories to increase processor utilization and application performance. The underlying premise upon which the cache concept is based is the principle of locality. The idea is that only a small amount of the main memory is being employed at any given time and that locations in that space are being accessed multiple times. Mainly due to loops, nested loops and subroutines, instructions and their associated data experience temporal, spatial and sequential locality. This means that once a block of instructions and data have been copied from the main memory into an IP’s cache, the IP will typically access them repeatedly.

Today’s high-end CPU IPs usually have a minimum of a Level 1 (L1) and Level 2 (L2) cache, and they often have a Level 3 (L3) cache. Also, some accelerator IPs, like graphics processing units (GPUs) often have their own internal caches. However, these latest-generation high-end IPs often have a 5X to 10X price compared to their previous-generation mid-range counterparts. As a result, as illustrated in figure 1, the CPU in a cost-conscious SoC may come equipped with only an L1 cache.

Let’s consider the CPU and its L1 cache in a little more depth. When the CPU requests something in its cache, the result is called a cache hit. Since the L1 cache typically runs at the same speed as the processor core, a cache hit will be processed in a single processor clock cycle. By comparison, if the requested data is not in the cache, the result, called a cache miss, will require access to the main memory, which will consume 150 processor clock cycles.

Now consider running 1,000,000 instructions. If the cache were large enough to contain the whole program, then this would consume only 1,000,000 clock cycles, resulting in a CPU efficiency of 1,000,000 instructions/1,000,000 clock cycles = 100%.

Unfortunately, the L1 cache in a mid-range CPU will typically be only 16KB to 64KB in size. If we assume a 95% cache hit rate, then 950,000 of our 1,000,000 instructions will take one processor clock cycle. The remaining 50,000 instructions will each consume 150 clock cycles. Thus, the CPU efficiency in this case can be calculated as 1,000,000/((950,000 * 1) + (50,000 * 150)) = ~12%.

Turbocharging performance

A cost-effective way of turbocharging the performance of a cost-conscious SoC is to add cache IPs. For example, CodaCache from Arteris is a configurable, standalone non-coherent cache IP. Each CodaCache instance can be up to 8MB in size, and multiple copies can be instantiated in the same SoC, as demonstrated in figure 2.

Fig. 2: Portion of a turbocharged, non-cache-coherent SoC. (Source: Arteris)

It is not the intention of this article to suggest that every IP should be equipped with a CodaCache. Figure 2 is intended only to provide examples of potential CodaCache deployments.

If a CodaCache instance is associated with an IP, it’s known as a dedicated cache (DC). Alternatively, if a CodaCache instance is associated with a DDR memory controller, it’s referred to as a last-level cache (LLC). A DC will accelerate the performance of the IP with which it is associated, while an LLC will enhance the performance of the entire SoC.

As an example of the type of performance boost we might expect, consider the CPU shown in figure 2. Let’s assume the CodaCache DC instance associated with this IP is running at half the processor speed and that any accesses to this cache consume 20 processor clock cycles. If we also assume a 95% cache hit rate for this DC, then—for 1,000,000 instructions—our overall CPU+L1+DC efficiency can be calculated as 1,000,000/((950,000 * 1) + (47,500 * 20) + (2,500 * 150)) = ~44%. That’s a performance boost of ~273%!

Conclusion

In the past, embedded programmers relished the challenge of squeezing the highest performance possible out of small processors with low clock speeds and limited memory resources. In fact, it was common for computer magazines to issue challenges to their readers along the lines of, “Who can perform task X on processor Y in the minimum number of clock cycles using the smallest amount of memory?”

Today, many SoC developers enjoy the challenge of squeezing the highest performance possible out of their designs, especially if they are constrained to use lower-performing mid-range IPs. Deploying CodaCache IPs as dedicated and last-level caches provides an affordable way for engineers to turbocharge their cost-conscious SoCs. To learn more about CodaCache from Arteris, visit arteris.com.

The post Turbocharging Cost-Conscious SoCs With Cache appeared first on Semiconductor Engineering.

NoC Development – Make Or Buy?

Semiconductor Engineering

Od: Frank Schirrmeister

29. Únor 2024 v 09:04

In the selection and qualification process for semiconductor IP, design teams often consider the cost of in-house development. Network-on-Chip (NoC) IP is no different. In “When Does My SoC Design Need A NoC?” Michael Frank and I argued that most of today’s designs – even less complex ones – can benefit from NoCs. In the blog “Balancing Memory And Coherence: Navigating Modern Chip Architectures,” I discussed the complexity that coherency adds to on-chip interconnect. After I described some of the steps of NoC development based on what ChatGPT 3.5 recommended in “Shortening Network-On-Chip Development Schedules Using Physical Awareness,” it’s time to look at more detail at the development efforts that design teams would have to invest to develop coherent NoCs from scratch.

ChatGPT, here we go again!

The prompt “Tell me how to develop an optimized network-on-chip for semiconductor design, considering the aspects of cache coherency” gives an excellent starting point in ChatGPT 4.0.

Understanding Protocols: First, one needs to understand cache coherency protocols. The recommendation is to study existing protocols before selecting one. Specifically, understand existing cache coherency protocols like MESI (Modified, Exclusive, Shared, Invalid), MOESI (Modified, Owned, Exclusive, Shared, Invalid), and directory-based protocols. Analyze their strengths and weaknesses in terms of scalability, latency, and bandwidth requirements. Then, choose a protocol that aligns with your performance goals and the scale of your NoC. Directory-based protocols are often preferred for larger-scale systems due to their scalability.

ChatGPT’s recommendation for the first step is a good start. I previously discussed the complexity of specific protocols like AMBA AXI, APB, ACE, CHI, OCP, CXL, and TileLink in “Design Complexity In The Golden Age Of Semiconductors.” One must read several thousand pages of documentation to understand the options here. And – by the way – these are orthogonal to the MESI/MOESI commentary from ChatGPT above, as these are implementation choices. In a practical scenario, many of these aspects depend on the building blocks the design team wants to license, like processors from the Arm, RISC-V, Arc, Tensilica, CEVA, and other ecosystems, as well as the protocol support in design IP blocks (think PCIe, UCIe, LPDDR) and accelerators for AI/ML.

NoC Architecture Design: Second, ChatGPT recommends focusing on NoC architecture design. Decide on the NoC topology (e.g., mesh, torus, tree, or custom designs) based on the expected traffic pattern and the scalability requirements. Each topology has its specific advantages, as my colleague Andy Nightingale recently explained here. Furthermore, teams must design efficient routers to handle the chosen cache coherency protocol with minimal latency, implementing features like virtual channels to avoid deadlock and increase throughput. The final part of this step involves optimizing the network for bandwidth and latency by tuning the buffer sizes, employing efficient routing algorithms, and optimizing link widths and speeds.

Cache Coherency Mechanism Integration: Next up, ChatGPT recommends integrating the actual mechanisms of cache coherency. Integrating the cache coherency mechanism with the NoC involves efficient propagation of coherency messages (e.g., invalidate, update) across the network with minimal latency. Designing an efficient directory structure for directory-based protocols that can scale with the network and minimize the coherency traffic requires careful considerations of the placement of directories and the granularity of coherence (e.g., block-level vs. cache-line level).

By the way, for my query, it leaves out the option to handle coherency fully in software.

Simulation and Analysis: At this point, ChatGPT correctly recommends using simulation tools to model your NoC design and evaluate its performance under various workloads. Tools like Gem5, NS-3, or custom simulators can be helpful. I would add SystemC models to the arsenal of tools design teams working on this from scratch could use. Teams need to analyze key performance metrics such as latency, throughput, and energy consumption and pay special attention to the performance of the cache coherency mechanisms.

The last bit is indeed critical as for coherent interconnects, the cost of a cache miss is drastically different from a cache hit.

Optimization and Scaling: This recommendation includes implementing adaptive routing algorithms and dynamic power management techniques to optimize performance and energy efficiency under varying workloads and ensuring the design can scale by adding more cores. This might involve modular design principles or hierarchical NoC structures.

Correct. But, in all practicality, at this point during the project, a lot of time has passed without writing a single line of RTL. Management will ask, “What’s up here?” So, some RTL coding has already happened at this point. Iterations happen fast. Engineers will blame marketing quickly for iterative feature change requests like adding/removing interfaces, changing user bits, Quality of Service (QoS) requirements, address maps, safety needs, buffering, probes, interrupts, modules, etc. All of these can cause significant changes to the RTL. At this point, the consideration has sometimes not started yet that the floorplan can cause more issues because of interface location, blockages, and fences.

Prototyping and Testing: Next, the recommendation is to use FPGA-based prototyping to validate your NoC design in hardware and to test the NoC in the context of the entire system, including the processor cores, memory hierarchy, and peripherals, to identify any issues with the cache coherency mechanism or other components.

True. Emulation and FPGA-based prototyping have become standard steps for most complex designs today. And especially the aspects of cache coherency in the context of the overall system and its software require very long test sequences.

Iterative Design and Feedback: The last recommendation is to use feedback from the simulation, prototyping, and testing phases to refine the NoC design iteratively and benchmark the final design using standard benchmark suites relevant to your target application domain to ensure that it meets the desired performance and efficiency goals.

The cost of “make”

Bar hiring a team of architects with relevant NoC development experience, the first five steps of Understanding Protocols, NoC Architecture Design, Cache Coherency Mechanism Integration, Simulation & Analysis, and Optimization & Scaling will take significant learning time, and writing the implementation specs is far from trivial.

Then, teams will spend most of the effort on RTL development and verification. Just imagine writing RTL protocol adapters for AMBA CHI-E, CHI-B, ACE, ACE-LITE, and AXI – connecting tens of IP blocks coherently – to address coherent and IO coherent use models. Even if you can reuse VIP from EDA vendors to check the protocol correctness, the effort is significant just for unit verification, as you will run thousands of tests.

For the actual interconnect, whether you use a heterogenous, ring, or mesh topology, the effort for development is significant. The logic that deals with directories to enable cache coherency can be complicated. And any change requests require, of course, re-coding!

Finally, when integrating everything in the system context, the effort to validate integration issues, including bring-up in emulation and associate debug, consumes another chunk of effort.

Our customers tell us that when it is all said and done, they would easily spend over 50 person-years just on coherent NoC development for complex designs.

Network-on-chip automation for productivity and configurability.

Automation potential: What to expect from coherent NoC IP

There is a lot of automation potential in the seven steps above!

The various relevant protocols can be captured in a library of protocol converters, reducing the need to internalize and implement all the protocols the reused IP blocks speak of. Ideally, they would already be pre-validated with popular IP blocks from leading IP vendors – think providers of Arm and RISC-V ISAs and vendors for interface blocks like LPDDR, PCIe, UCIe, etc., or graphics and AI/ML accelerators.
Graphical user interfaces and scripting APIs increase productivity in developing NoC architectures and topologies.
Like protocol converters, reusable blocks for directory and cache coherency management can increase development productivity. Their verification is especially critical, so ideally, your vendor has pre-verified them using VIP from EDA vendors and pre-validated the system integration with the ecosystem (think processor providers).
The refinement loop is probably the most critical one to optimize. Refinement can be deadly in manual scenarios. Besides reusable building blocks, you should look for configuration tools to automatically create performance models for architectural analysis, export new RTL configurations, and directly connect to digital implementation flows.

The verdict: Make or buy?

The illustration above summarizes some of the automation potential for NoCs. Saving more than 50 person-years is an attractive option compared to developing NoC IP from scratch. Check out what Arteris does in this domain with its Ncore Cache Coherent Interconnect IP and FlexNoC 5 Interconnect IP for non-coherent designs.

The post NoC Development – Make Or Buy? appeared first on Semiconductor Engineering.