Nvidia is known as the GPU of choice for AI workloads. Its Hopper and Blackwell GPUs power virtually every cloud provider and AI training farm of substantial size around the world. As powerful as Nvidia GPUs may be, however, the Nvidia Grace CPU seems to fly under the radar as a legitimate compute platform even though it is capable of supporting the most demanding enterprise workloads.
Grace is one of many CPUs built on the Arm Neoverse platform to power the workloads that enterprise IT organizations rely on every day. Every major cloud service provider has partnered with Arm to design and produce silicon, whether CPUs or accelerators, across their datacenters. And they are doing this for a good reason—because there is a value that can be derived from Arm Neoverse that can’t be realized from other current suppliers.
This research note explores how Arm has become a critical silicon player in the datacenter and a strategic partner to top silicon and hardware makers.
Digging into the Grace Details
Launched in 2023, the Grace CPU is built on Arm Neoverse V2, the second generation of Arm’s high-performance infrastructure platform. With Neoverse V2, Arm significantly improved performance, energy efficiency, scalability, and security over the prior generation.
Grace is built on an Nvidia-specific TSMC 4nm process node. It ships 72 cores connected by Nvidia’s Scalable Coherency Fabric (SCF), which delivers up to 3.2 TB/s of throughput to keep cores fed and data moving. The Nvidia SCF also connects the compute complex with I/O, memory, NVLink, and cache.
For I/O, Grace ships with 128 lanes of PCIe Gen5. However, because of the Arm Neoverse platform’s flexibility (more on that shortly), Grace utilizes NVLink-C2C, which is about seven times faster than PCIe for CPU-CPU and CPU-GPU connectivity. This NVLink-C2C connection also enables a CPU and GPU coherent memory model, enabling better performance for accelerated computing uses such as AI and HPC.
Nvidia took a unique approach for Grace’s memory. Rather than deploy traditional memory, the company chose to use low-power DIMMs (LPDDR5)—the same memory one would find in a laptop. It took these DIMMs and made them server/enterprise-ready through RAS capabilities to reduce power consumption a watt at a time. In fact, Nvidia claims it can generate roughly 500Gb/second of memory bandwidth throughput using just 15 watts.
The standalone Grace CPU, also known as the Grace Superchip, is especially significant. To me, it represents how the Arm ecosystem in general—and Nvidia in particular—can deliver platforms to support workloads everywhere, whether on the cloud or in on-prem datacenters. The Grace Superchip challenges the dominance of x86 in applications and workloads that were once considered walled off. More than that, it delivers what the company claims is two times the performance per watt of x86 platforms.
Arm Is a Compute Platform Enabling Unprecedented Flexibility
Nvidia selecting Arm as its partner and Neoverse as its CPU platform to deliver Grace is a design choice rooted in flexibility. Neoverse enables companies like Nvidia to produce homegrown SoCs quickly and efficiently. Arm does this by delivering a compute platform that overcomes the limitations of x86 and enables chip producers to add their own innovations where they see fit. Need a wider instruction pipeline? Want to utilize low-power memory to achieve greater power efficiency? Or add a high-speed interconnect that delivers greater throughput than commercially available? Arm Neoverse enables any of that—and the x86 suppliers don’t.
While Nvidia’s use of Neoverse possibly has the highest profile because of the attention drawn by AI workloads and by Nvidia itself, we also see Neoverse used more broadly in the cloud service provider (CSP) space, where AWS (Graviton), Azure (Cobalt), and Google Cloud (Axion) have all taken Neoverse and built custom chips that optimize for their specific performance and power targets.
For CSPs, performance and power optimization are everything. Getting this balance just right is critical to lowering costs while satisfying SLAs. Only so much customization can occur when deploying x86 chips to support these environments, which prevents CSPs from achieving optimal efficiency. The reason for this is simple: x86 suppliers have designed a chip to support a broad range of workloads, customers, and deployment scenarios. In attempting to meet the needs of all customers broadly with a single go-to-market approach, they are prevented from precisely meeting the needs of individual customers.
By contrast, when CSPs look at Neoverse, they see a platform offering the level of deep customization they want. Further, the CSPs can license and design custom silicon while still realizing cost savings over incumbent x86 alternatives. This translates to substantial medium- to long-term savings for these cloud providers with Arm as their compute platform.
While CSPs have used Neoverse to deliver consistent performance and cost savings, the Arm server CPU market also has legs on the commercial side. The aforementioned Grace Superchip is explicitly designed to support enterprise workloads wherever they run—in the cloud or on-prem. Likewise, Ampere has built a healthy business in shipping the industry’s first commercial Arm-based server CPU. That chip runs in the largest clouds, and Ampere has partnered with Qualcomm to deliver a powerful yet power-efficient inference platform as well.
It’s Time to Have the Arm Conversation
Not only are Arm CPUs populating cloud datacenters, but the workloads being supported by these processors have grown from cloud-native and cloud-specific to virtually anything. This is partly because of the Neoverse-based CPUs’ performance, and partly because the software ecosystem embraces Arm. The walled garden that the x86 architecture enjoyed for decades continues to shrink as Arm gains in popularity and ecosystem support—both in hardware and software.
Tomorrow’s datacenter will comprise different compute engines driving highly bespoke workloads with unique performance requirements. This dynamic has been evident in both general-purpose and high-performance computing for some time. The traditional model of designing silicon for broad use and having it run in a less-than-optimal fashion is, well, out of fashion. Application architects and IT organizations want compute platforms that are finely tuned for performance yet universal underneath for the sake of management and interoperability. And that is Arm in a nutshell.
Arm is already broadly adopted in the cloud, and it’s coming to the enterprise datacenter, no question. It will be the silicon platform for many essential modern workloads that drive the business—workloads that benefit from a specific kind of acceleration or customization that simply can’t be achieved in x86.
It’s just a matter of time.