Qualcomm Cloud AI 100 continues to lead in power efficiency with the latest MLPerf v3.0 results
ML Common's™ efforts to create machine learning benchmarks for training and inference testing platforms continues with MLPerf v3.0. The open engineering consortium has received numerous submissions to its benchmark suite to measure fast systems and train models to a target quality metric and once again, Qualcomm Technologies, Inc. has demonstrated their leadership in this field with Qualcomm Cloud AI 100.
The MLPerf™ v3.0 inference results for the Qualcomm Cloud AI 100 demonstrate the commitment of Qualcomm Technologies to the MLCommons™ MLPerf™ Benchmarks by extending their submission scope to all benchmark divisions — including closed, open, and network — with all power submissions. The Qualcomm Cloud AI 100 continues to lead power efficiency benchmarks not only with current generation of its nearest competition, but also many benchmarks of the competitive next-generation AI accelerators. All benchmarks show an increase in performance and power efficiency for natural language processing and computer-vision networks.
For the first time, Qualcomm Technologies introduced the Qualcomm Cloud AI 100 PCIe Lite accelerator, which is configurable from 35-55W Thermal Design Power (TDP) addressing AI Edge market requirements for low power, high performance, and driving higher performance/TCO$. Qualcomm Cloud AI collaborators have added several new server platforms for the edge and datacenter categories. The newly introduced platforms include the HPE ProLiant DL385 server, Dell PowerEdge XR4520c, Dell PowerEdge R650, and Lenovo ThinkEdge SE450. Qualcomm AI software collaborator company, Krai.ai, has made several Qualcomm Cloud AI 100 based platforms submissions including the Qualcomm Robotics RB6 Platform.
Qualcomm Technologies’ MLPerf™ benchmark results continue to expand in every submission. MLPerf™ v3.0 has 25 Systems and more than 320 results based on the Qualcomm Cloud AI 100 accelerators.
Qualcomm Technologies’ MLPerf™ v3.0 inference benchmark results have surpassed its own previous records of peak offline performance, power efficiency, and lower latencies in all categories. The 2U datacenter server platform with 18x Qualcomm Cloud AI 100 Pro (75W TDP) accelerators achieved the ResNet-50 offline peak performance of 430K+ inference per second and achieved power efficiency of 241 inference/second/watt. The Qualcomm Cloud AI 100 RetinaNet Network offline performance across all platforms has been optimized by ~40-80% and server performance by up to 110%.
Since its first MLPerf™ 1.0 submission, Qualcomm Cloud AI 100 performance has been improved by up to 86%, while power efficiency has been improved by up to 52%. The improvements have been achieved via software optimizations, such as improvements in AI Compiler, DCVS algorithms, and memory usage. This highlights the continuous efforts and investment over the last few years.
Qualcomm Technologies continues to innovate and optimize AI solutions across all submissions with the focus on power efficiency and performance. This year with our PCIe Lite card, we are showing our capabilities to meet the needs of AI edge applications. The low-power, high-performance AI inference solution can be configured from 35-55W TDP depending on the application requirements. Like PCIe Pro and PCIe Standard SKU, it supports a wide range of neural network models and frameworks, such as TensorFlow, PyTorch, and ONNX Runtime. It also leverages the Qualcomm Neural Processing SDK for AI, which provides a unified API for easy integration and deployment. Here are some of the key benefits from the PCIe Lite card:
- High throughput and low latency: The accelerator can deliver close to 200 TOPS of peak performance at 55W TDP, enabling fast and responsive inference for complex models such as BERT and RetinaNet.
- Scalability and flexibility: The accelerator can be used in various form factors and configurations, such as single or multiple cards per server, or standalone devices with PCIe or USB interfaces. It can also support heterogeneous computing with other processors such as CPUs or GPUs.
- Power efficiency and reliability: The accelerator uses advanced power management techniques to optimize power consumption and thermal performance.
As AI and ML workloads grow in size and complexity, they demand more computing resources and energy consumption. This poses a challenge for both providers and end users who want to deliver and access high-quality services at a reasonable cost. Therefore, it is essential to perfect the performance and efficiency of the solutions that support these workloads. Qualcomm Technologies offer wide range of power-efficient AI accelerators to meet the performance/TCO$ requirements.
Qualcomm Technologies is committed to innovating and optimizing AI solutions across all submissions with a focus on power efficiency. The Qualcomm Cloud AI 100 has yet again delivered record-breaking performance with the highest-ever power efficiency inference-per-second-per-watt (I/S/W) of 315 for RestNet-50 and 5.9 for RetinaNet, surpassing the nearest competition by more than 2x. In Datacenter category, Qualcomm Cloud AI 100 PCIe Pro accelerator continues to excel in power efficiency not only in its generation of AI accelerators, but also over the nearest competition with their next generation of accelerators for ResNet-50 and RetinaNet networks by 50%.
For the Closed Edge division, the HPE submission with Edgeline Converged Edge system server e920d and 4x Qualcomm Cloud AI 100 standard AI accelerators has recorded lowest single-stream latency of 0.34 millisecond and multi-stream (8) latency of 0.58ms for ResNet-50 networks.
We made network division submissions applicable for the datacenter for the first time and achieved nearly the same results as the Closed division. Very high server to offline performance QPS (Queries Per Second) ratios and matching network division performance demonstrates the ability of Qualcomm Cloud AI 100 inference accelerators to deliver the same level of performance in a wide variety deployment condition.
In the Open division, Qualcomm Technologies submitted 3 BERT Networks (Block Pruned 83_2, Block pruned 82_6, and DistilBERT) with same SQuADv1.1 dataset used by MLPerf™ BERT models. The BERT Pruned models delivered record-breaking power efficiency up to 22 I/S/W and nearly 3x the performance, while retaining the MLPerf™ accuracy requirements of 99% or more of FP32 F1 score. The BERT pruned model details can be found as below.
- BERT Block Pruned 83.2: 131M parameters: 32% of non-embedding parameters in original BERT-large
- BERT Block Pruned 82.6: 106M parameters : 24% of non-embedding parameters in original BERT-large
Qualcomm Technologies continues to prove leadership in AI solutions with the MLPerf™ v3.0 submissions. We have highlighted our breadth of inference applications for both Edge and Datacenter categories and have outperformed competitors in key metrics such as inference-per-second and I/S/W . The Qualcomm Cloud AI 100 is a unique solution that offers high computational performance, low latency, and low power utilization and this year we have not only displayed our performance in the DM.2, standard, and Pro accelerator cards but also with our Qualcomm Cloud AI 100 Lite accelerator card.

