Unlocking the power of Qualcomm QNN Execution Provider GPU backend for ONNX Runtime
In the ever-evolving landscape of Generative AI and Deep Learning, diverse workloads necessitate the use of NPU, GPU, and CPU to achieve the best experience on edge platforms. At Qualcomm Technologies, we continuously enhance our AI software stack to enable optimal performance on devices with Snapdragon X Elite.
Overview
Current ONNX Runtime supports either Qualcomm® AI Engine Direct (QNN) Execution Provider (EP) NPU (Qualcomm HTP) or DirectML GPU stack. Today we are pleased to announce the preview of the ONNX Runtime Qualcomm® AI Engine Direct (QNN) EP with the Qualcomm Adreno GPU backend.
Choosing the Backend
The choice of the backend for the ONNX Runtime Qualcomm (ORT) QNN EP is specified during the inference session creation time. This is done through Qualcomm ORT QNN EP options. Other than this, there is no difference between the HTP backend and the GPU backend in terms of usage by client code.
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
# Create an ONNX Runtime session.
session = ort.InferenceSession(
model_path,
sess_options=options,
providers=["QNNExecutionProvider"],
provider_options=[{"backend_path": "QnnGpu.dll"}]
)
std::unordered_map qnn_options;
qnn_options["backend_path"] = "QnnGpu.dll";
Ort::SessionOptions session_options;
session_options.AppendExecutionProvider("QNN", qnn_options);
Ort::Session session(env, model_path.c_str(), session_options);
Running Inference fully on Adreno GPU
To check if your model graph can fully run on the Qualcomm ORT QNN EP’s GPU backend, you can disable CPU fallback and run the model with the Qualcomm ORT QNN EP. If your model runs successfully, it indicates that it ran fully on the GPU backend. Below are examples of how this can be done:
# (Optional) Enable configuration that raises an exception if
# the model can't be run entirely on the QNN backend.
options.add_session_config_entry("session.disable_cpu_ep_fallback", "1")
session_options.AddConfigEntry("session.disable_cpu_ep_fallback", "1");OpenCL Driver Dependency
The GPU backend depends on the OpenCL driver installed on the system. Any driver at or after the preview version runs correctly with Qualcomm QNN SDK versions 2.33 and above.
For lower versions of the driver, a workaround is available by enabling verbose logging level in WCR. Below are examples of how this can be done:
qnn_options["log_severity_level"] = "0";
# Set the logging level to Verbose for QnnGpu to work with older driver
ort.set_default_logger_severity(0)Supported Models
The GPU backend is currently in its preview stages of enablement. The following models are validated on Qualcomm GPU:
- resnet_50_fp16
- resnet_50_fp32
- google_vit_base_fp32
- squeezenet1.0-7
- mobilenetv2-7
- emotion-ferplus-8
- arcfaceresnet100-8
- intel_bert_fp32
- google_bert_fp32
Other models may run partially on the GPU, with the rest falling back to the CPU.
Exclusive Backend Choice
Please note that while the choice of EP is a priority order for subgraph execution, the choice of Qualcomm ORT QNN EP backend is an exclusive choice for the EP. This means it’s an "either or" choice.
If the HTP backend is chosen for the Qualcomm ORT QNN EP, then the GPU backend cannot be used for the Qualcomm ORT QNN EP at the same time for the same session.
Conclusion
We would like to receive feedback on the ONNX Runtime QNN GPU backend as we continue to enable more models, including LLMs.
Join a community of like minded developers on Qualcomm Developer Discord to benefit from real time discussions and exclusive virtual events.
Additional resources:
Qualcomm AI Engine Direct Execution Provider documentation on ONNX Runtime website
Come for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.

