Back to All
OnQ Blog

Quantization: Unlocking scalability for large language models

Find out how LLM quantization solves the challenges of making AI work on device

In the rapidly evolving world of artificial intelligence (AI), the growth of large language models (LLMs) has been nothing short of astounding. These models, which power everything from chatbots to advanced code generation tools, have grown exponentially in size and complexity. However, this growth brings significant challenges, particularly when it comes to deploying these models on devices with limited memory and processing power. This is where the innovative field of LLM quantization comes into play, offering a pathway to scale AI more effectively.

AI models are increasing rapidly in size, which makes them difficult to deploy on devices.
AI models are increasing rapidly in size, which makes them difficult to deploy on devices.

The challenge of large language models

The last years have seen a rise of flagship LLMs such as GPT-4 and Llamav3-70B. However, these models have tens to hundreds of billions of parameters and are too large to run on low-power edge devices. Smaller LLMs like Llamav3-8B phi-3 still achieve amazing results, and can run on low-power edge devices, but do require substantial memory and computational resources. This poses a problem for deployment on low-power edge devices, which have less working memory and processing power compared to cloud-based systems.

3-bit weights are practical for LLMs and are a scalable solution.
3-bit weights are practical for LLMs and are a scalable solution.

Enter quantization: A solution for efficiency

Quantization is a technique used to reduce the precision of the numbers used in computations, which in turn decreases the model size and the resources needed to run it. By converting floating-point representations to lower-bit integers, quantization significantly reduces the memory requirements and speeds up the processing time, making it feasible to run LLMs on edge devices.

Qualcomm AI Research has been creating and publishing state-of-the-art research in model efficiency for many years. Our work focuses on various quantization techniques, including post-training quantization (PTQ) and quantization-aware training (QAT). QAT incorporates quantization into the training process, allowing the model to adapt to the reduced precision and maintain higher accuracy. PTQ is a method that involves taking a pre-trained model and directly converting it into a lower precision format, which is faster and requires less expertise, but is also less accurate. How do we get the best of both worlds?

Qualcomm-image
The AI Model Efficiency Toolkit (AIMET) library provides advanced quantization and compression techniques for trained neural network models.

Supporting developers in quantizing their models

The Qualcomm Innovation Center has open-sourced the AI Model Efficiency Toolkit (AIMET) and more recently has published the comprehensive Qualcomm AI Hub portal, which allows developers to compress, quantize and visualize their models properly, as well as choose from a library of over 100 pre-optimized AI models ready to deploy with for up to four times faster inferencing. 

Some newer techniques made available to the AI community are sequential mean squared error (MSE), which finds a weight quantization scale that minimizes output activation MSE, as well as knowledge distillation, which improves QAT through training a smaller “student” model to mimic a larger “teacher” model. Qualcomm AI Research’s paper on low-rank QAT for LLMs dives deeper into this topic.

Vector quantization helps employ non-linear quantization and expands the dimensionality of the representational grid.
Vector quantization helps employ non-linear quantization and expands the dimensionality of the representational grid.

The role of vector quantization in LLMs

A promising advancement is vector quantization (VQ), which Qualcomm Technologies has been actively developing. Unlike traditional methods that quantize each parameter independently, VQ considers the joint distribution of parameters, leading to more efficient compression and less loss of information.

Qualcomm Technologies' novel approach, Generative Pretrained Transformer Vector Quantization (GPTVQ), applies vector quantization to LLMs by quantizing groups of parameters together, rather than individually. This method preserves the accuracy of the models and reduces their size more effectively than traditional quantization methods. For instance, our experiments showed that GPTVQ could achieve near-original model accuracy with significantly reduced model size, making it a game-changer for deploying powerful AI models on edge devices.

 

Implications for AI scaling

The implications of effective LLM quantization are vast. By reducing the size and computational demands of LLMs, AI applications can become more accessible, running efficiently on a wider range of devices from smartphones to IoT devices. This democratizes access to advanced AI capabilities, enabling a new wave of innovation in AI applications across industries.

Moreover, the ability to run LLMs on edge devices also addresses privacy and latency issues associated with cloud computing, as data can be processed locally without the need to be sent to a centralized server.

As AI continues to advance, the ability to scale LLMs efficiently will be crucial for the next generation of AI applications. Quantization, particularly techniques like vector quantization developed by Qualcomm AI Research, offers a promising solution to overcome the memory and latency challenges of  running AI on low-power edge devices. With ongoing research and development, the future of AI looks not only more intelligent but also more accessible and efficient.

 

Explore more topics, insights and trends in on-device generative AI within our AI on the Edge series.

Explore more topics, insights and trends in on-device generative AI within our AI on the Edge series.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. AIMET is a product of Qualcomm Innovation Center, Inc.

About the Authors
Mart van Baalen
Mart van BaalenEngineer, Senior Staff/Manager, Qualcomm Technologies Netherlands B.V.
Abhijit Khobare
Abhijit KhobareSr. Director, Software Engineering, Qualcomm Technologies, Inc.
Qualcomm relentlessly innovates to deliver intelligent computing everywhere, helping the world tackle some of its most important challenges. Our leading-edge AI, high performance, low-power computing, and unrivaled connectivity deliver proven solutions that transform major industries. At Qualcomm, we are engineering human progress.

Stay connected

Get the latest Qualcomm and industry information delivered to your inbox.

Subscribe
Manage your subscription

© Qualcomm Technologies, Inc. and/or its affiliated companies.

Snapdragon and Qualcomm branded products are products of Qualcomm Technologies, Inc. and/or its subsidiaries. Qualcomm patented technologies are licensed by Qualcomm Incorporated.

Note: Certain services and materials may require you to accept additional terms and conditions before accessing or using those items.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell or license any of the services or materials referenced herein.