Nov 8, 2017
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
Today marks a major milestone in the processor industry — we’ve launched Qualcomm Centriq 2400, the world’s first and only 10nm server processor. While this is the culmination of an intensive five-year journey for the Qualcomm Datacenter Technologies (QDT) team, it also marks the beginning of an era that will see a step function in the economics and energy efficiency of operating a datacenter.
Cloud is reshaping datacenter computing
Cloud computing is growing at a torrid pace. Driven by the virtuous cycle of scale driving efficiencies, which in turn drive larger scale, cloud is expected to account for more than 50% of the datacenter server revenue by 20201. This growth is driving profound shifts in datacenter infrastructure. Cloud service providers (CSPs) have evolved their software stacks to take full advantage of modern, high-core-count processors — shifting from writing monolithic code that is deployed and scaled-up as one giant application to multi-threaded applications built for scale-out, including distributed databases, distributed file systems, and tiered application topologies. Microservices, independent bite-sized components that can be deployed through containers, are accelerating the momentum behind scale-out infrastructure. The Qualcomm Centriq 2400 is the next step in optimized performance for this new class of cloud datacenter infrastructure.
Purpose built for cloud
Three key elements stand out as characteristics of cloud software: It is highly threaded, throughput oriented, and distributed and deployed in scale-out configurations. Qualcomm Centriq 2400 is specifically designed for maximum efficiency running cloud software.
There are five dominant characteristics that a processor optimized for the cloud needs to meet:
- High aggregate throughput performance with high per-thread performance under load
- A large number of hardware threads that multi-threaded software can fully utilize
- Quality of service (QoS) features to ensure resources are allocated fairly (i.e., avoiding the ‘noisy neighbor’ issue)
- High energy efficiency to maximize compute density and reduce operating costs
- Low acquisition costs
From the very beginning, we’ve taken these as the fundamental tenets for the design of the Qualcomm Centriq 2400 processor. From concept to architecture to design and development, we translated these tenets into a cutting-edge processor, and today we disclosed that the Qualcomm Centriq 2400 delivers exceptional throughput performance, performance per watt and performance per dollar.
Throughput and per-thread performance
The Qualcomm Centriq 2400 processor, based on the Qualcomm Falkor CPU, QDT’s own Armv8-based custom CPU core design, delivers leading-edge aggregate performance, as shown by SPECint_rate20062 score estimates. These scores are based on the open source gcc compiler, using -O2 flags, consistent with how cloud developers compile their own code3.
Many cloud applications require real-time responsiveness, necessitating single-thread performance while the machine is running multiple threads at high utilization. For this, the single-thread SPECint_2006 benchmark is not the relevant choice, as it measures performance when the machine is at its minimum loading. Instead, we looked at the aggregate performance of the machine using SPECint_rate2006, and dividing by the number of hardware threads active — a reflection of the single-thread performance of any individual thread when the server is operating at its design point of maximum multi-threaded performance. By that metric, the Qualcomm Centriq 2400 has not only reached high aggregate performance, but it has done so without compromise on per-thread performance.
Many CSPs require predictable performance to meet their customer demands and SLAs. The specified peak frequency for the Qualcomm Centriq 2400 family is independent of the number of cores that are active. This means that CSPs can minimize performance variability as more cores are switched on to handle increased load.
The Qualcomm Centriq 2400 delivers better performance per watt than competing x86 server processors4. We’ve taken a typical Qualcomm Centriq 2460 processor and run SPECint_rate2006, measuring the average power for each sub-test. All tests ran at the full 2.6 GHz peak frequency. As a first-order view, the average (both mean and median) power of those measurements was 65W. Running the same test on an Intel Xeon Platinum 8176, which has similar SPECint_rate2006 performance when compiled with gcc -O2, the power we measured was significantly higher — running at 100% of its 165W thermal design power (TDP) and burning over 2.5x as much electricity for similar performance!
Another important metric is the processor TDP, as servers will be designed based on the specified TDP. Stepping back from the highest bin parts, we can compare the Qualcomm Centriq 2452 processor with the Intel Xeon Gold 6152. Using SPECint_rate2006 performance divided by TDP, the Qualcomm Centriq 2452 has 33% better performance per watt. Looking at the inverse, with racks typically limited in power capacity, that translates to a significant increase in the amount of compute capacity that can be packed into a rack. (Actual increase depends on server overhead power, server utilization, and rack capacity, among other things.)
Idle power is also an important metric for many datacenter customers, as unnecessary power draw during idle periods can result in significant energy consumption costs over the period of an infrastructure’s useful life. The Qualcomm Centriq 2400 family delivers extremely low idle power. We’ve measured power during OS idle at 8W even when the deepest idle state is limited to C1 in order to minimize idle exit latency. With deeper idle states enabled, measured power plummets to below 4W, using Qualcomm Centriq 2400’s fast power collapse with hardware save/restore logic. In environments where server utilization is low, this combination of low power during both active and idle states translates to significant energy savings and a much greener datacenter.
Total cost of ownership
The biggest factor in the TCO of running a datacenter, however, is the acquisition cost of the servers, and the processor is one of the most expensive components on the server. The Qualcomm Centriq 2400 processor delivers a phenomenal performance-per-dollar. With a list price5 of $1,995, the 48-core Qualcomm Centriq 2460 processor delivers 4X better performance-per-dollar versus Intel’s highest-performance Skylake processor, the Intel Xeon Platinum 8180. With a list price of $1373, the 46-core Qualcomm Centriq 2452 processor offers 3X better performance-per-dollar versus Intel Xeon Gold 6152. And, with a list price of $888, the 40-core Qualcomm Centriq 2434 processor offers 2X better performance-per-dollar versus Intel Xeon Silver 41166.
Qualcomm Centriq 2400 delivers many other key benefits for the cloud, such as quality of service management, in-line memory bandwidth compression, and secure root of trust at the silicon level, which we detailed here and here.
Driving an open ecosystem
Driving an open ecosystem around the Qualcomm Centriq 2400 processor is a critical pillar of our strategy. To us, open ecosystem means embracing open standards and collaboration with hardware, software, and system vendors. Through these collaborations, we’re delivering best-of-breed solutions for our customers to deploy on Qualcomm Centriq 2400 processors.
Over the past few years, the Arm-based processor ecosystem has made tremendous progress in enabling server software for the cloud. Most open source software is already available on Arm-based server processors. Foundational software such as firmware, operating systems, compilers, virtualization and containers is supported on Arm processors, and infrastructure software such as language runtimes, databases (NoSQL and SQL), web front end, data analytics, and orchestration is also supported on Arm processors.
Key cloud workload targets
With leading-edge performance, innovative features, and an open ecosystem, the Qualcomm Centriq 2400 family is optimized for cloud native workloads. Workloads that are a good fit for Qualcomm Centriq 2400 processors include web front end, NoSQL databases, big data analytics, content delivery networks, video and image processing applications, image recognition, health-and life-sciences applications, and software defined NVMe storage farms. At our launch event today, we’re demonstrating many of these cloud workloads running on Qualcomm Centriq 2400 processor based servers.
In optimizing for cloud workloads, there is understandably a set of workloads that we are not currently targeting. Some traditional enterprise IT workloads that don’t scale with cores fall into this category. A good example here would be transactional databases that use scale-up servers to be able to handle large databases.
We’re excited about bringing to market the world’s first and only 10nm server processor. Qualcomm Centriq 2400 delivers exceptional throughput performance, leadership performance-per-watt and performance-per-dollar, and drastically shifts the economics of ownership and operation for cloud datacenter operators. We’re looking forward to continuing to work with our customers and partners to drive further innovations into datacenter infrastructure.
1. According to an IDC report from December 2016.
2. These are called estimated scores, as they have not yet gone through the SPEC.org reporting methods.
3., 4., 6. Details on performance measurements are in the end notes section of presentation posted here.
5. List prices as of 11/8/2017.