OnQ Blog

Shifting AI inference from the cloud to your phone can reduce AI costs

Every AI query has a cost, and not just in dollars. Study shows distributing AI workloads to your devices — such as your smartphone — can reduce costs and decrease water consumption

Written by

Vinesh Sukumar

Sep 16, 2025

4 min read

What you should know:

Study shows running AI inference on your phone can potentially be more resource efficient than relying solely on the cloud.
Hybrid AI architectures that combine on-device and cloud processing can optimize cost, performance and sustainability for everyday workloads.
Understanding the hidden financial and environmental costs of AI inference is essential for responsible and scalable technology adoption.

Imagine this: you’ve been tasked with writing a report on computing technologies trends in 2050 by tomorrow. You open your favorite GenAI assistant app, draft a prompt and hit send. A few milliseconds later, words pop up on your screen. All that for free, right?

In fact, it’s more complicated than that. Every AI query has a cost — in dollars, power and water — and not just when you’re trying to be polite. Consumers and businesses see subscription fees racked up on their credit cards, but they may not see the environmental costs of the reliance on massive data centers for all their computing. And with approximately 163 million users expected in the US by 2029,¹ the impact on our wallets and the environment will continue to grow as Gen AI apps become more ubiquitous in our lives.

But there is a way forward. Instead of relying solely on cloud servers, shifting a portion of the AI inference to devices like your smartphone can be much more resource-efficient than relying solely on the cloud.

This isn’t groundbreaking: many everyday GenAI use cases can already be handled on devices — such as the latest OpenAI gpt-oss model, that can run on Snapdragon. Models are getting smaller, while also more capable and more efficient,² and application developers are looking for ways to cut cloud inferencing costs and respond to a growing demand for privacy and personalization, especially in the wake of agentic AI. In parallel, the performance of Neural Processing Units (NPUs) continues to increase, making on-device AI an increasingly viable and attractive option.

"Instead of relying solely on cloud servers, shifting a portion of the AI inference to devices like your smartphone can be much more resource-efficient than relying solely on the cloud."

On-device AI can be more resource efficient than cloud inferencing

A recent study computed the hidden costs in liters of water, and joules — the standard unit of energy — for common AI prompts.³

They ran the same queries, using the same AI models, both on a Samsung Galaxy S24⁴ and on Google Colab cloud servers.⁵ They found that on-device AI inference uses less energy and water, and reduces carbon dioxide emissions.⁶

Electricity and water are critical to data center operations, providing the racks and racks of GPUs, TPUs and other AI accelerators with the power they need to operate, and equally importantly, to cool them down.

They found that running AI inference on a Samsung Galaxy S24 can reduce inference energy consumption by up to 95% and carbon footprint by up to 88% compared to running these workloads on Google Colab cloud servers. For water consumption,⁷ the average savings soar up to 96%.⁸

While the study is restricted to a small scope that can target a limited number of experiments and uses non-optimized cloud inference — which warrant further research for more robust conclusions — it suggests a promising path forward: moving from solely cloud-based AI inference to a hybrid approach that processes some routine workloads locally. This approach could relieve pressure on power grids and help reduce the environmental impact of large data centers.

The hidden costs of today’s data centers

Here is what happens under the hood: Currently, most AI inferences are cloud-based. This means your prompt is sent to a model that is hosted on a server in a data center. Once processed, the model sends the output back to your application. Here is a more visual walkthrough of your AI prompt’s journey — and energy consumption — by WSJ’s Joanna Stern.

LLM providers bill application developers and end users for using their infrastructure that processes your prompt, including AI acceleration hardware, storage, network bandwidth and the operational costs such as maintenance, technical support, utility expenses like water and power, and more. This makes it more expensive for developers to deploy GenAI applications, and makes those applications more expensive for end users. All of those ongoing cloud costs are part of why many GenAI products charge a monthly fee.

Study’s scope: The study compares the cost and environmental footprint of running generative AI models on the cloud versus edge devices. For the cloud, the models run on servers equipped with either Nvidia A100 or L4 GPUs, hosted on Google Colab. For edge devices, the models run on Samsung Galaxy S24 devices, powered by Snapdragon 8 Gen 3 processors.

"As GenAI models become smaller while on-device processing capabilities continue to grow, we believe that AI processing, from the cloud to the edge, can bring major benefits in cost, energy, performance, privacy, security and personalization."

Hybrid AI will unlock GenAI at scale

While on-device AI inference is ideal for some workloads, this isn’t a black and white situation. AI inference occurs on a spectrum, ranging from closest to the user to further away in the cloud.

Like we mentioned in this white paper, hybrid AI architectures can distribute AI inference among cloud and edge devices depending on a model and query’s complexity. If a model can run on a device for a given prompt without compromising its accuracy, latency and generation length, the inference should prioritize running on the edge device. If the model is more complex, the inference can be distributed between the device and the cloud, with devices running ‘light’ versions of the model while the cloud processes the ‘full’ model concurrently and corrects the device answers if needed.

Qualcomm Technologies is well-positioned to bring efficient inferencing from the cloud to edge

As GenAI models become smaller while on-device processing capabilities continue to grow, we believe that AI processing, from the cloud to the edge, can bring major benefits in cost, energy, performance, privacy, security and personalization. We are designing efficient AI inference solutions that leverage both the edge and the cloud. This hybrid set-up distributes AI workloads from cloud to edge appropriately to offer the best solution. This allows our customers and partners’ smartphones, PCs, IoT devices, vehicles and data centers to deliver more intuitive, productive and efficient user experiences around the globe.

Discover a new era of possibility with on-device AI

AI Research Edge AI

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

References:

1: Per https://www.emarketer.com/content/genai-user-forecast-2025, GenAI is shifting from explosive growth to mainstream adoption, with about 163 million users expected in the US by 2029.

2: AI disruption is driving innovation in on-device inference

3: Pengfei Li, Mohammad J. Islam, and Shaolei Ren. 2025. A Case Study of Environmental Footprints for Generative AI Inference: Cloud versus Edge. SIGMETRICS Perform. Eval. Rev. 53, 2 (September 2025), 21–26.

4: Samsung Galaxy S24, equipped with the Qualcomm Snapdragon 8 Gen 3c.

5: Cloud servers equipped with either Nvidia A100 or L4 GPUs (hosted on Google Colab).

6: Pengfei Li, Mohammad J. Islam, and Shaolei Ren. 2025. A Case Study of Environmental Footprints for Generative AI Inference: Cloud versus Edge. SIGMETRICS Perform. Eval. Rev. 53, 2 (September 2025), 21–26.

7: The water footprint of cloud and on-device inference includes the water consumed during the inference process only. Cloud inference water footprint includes the direct water usage, primarily driven by the cooling requirements of cloud servers hosted in data centers (Scope 1) and the water usage associated with electricity generation for powering the cloud servers (Scope 2). On-device inference water footprint includes the water usage associated with electricity generation for powering the edge devices (Scope 2).

8: Pengfei Li, Mohammad J. Islam, and Shaolei Ren. 2025. A Case Study of Environmental Footprints for Generative AI Inference: Cloud versus Edge. SIGMETRICS Perform. Eval. Rev. 53, 2 (September 2025), 21–26.

About the Author

Vinesh SukumarVP, Product Management of AI/GenAI, Qualcomm Technologies, Inc.