Comparing LLM models with the Qualcomm AI Inference Suite
Sign up for Developer monthly newsletter
Join thousands of developers around the globe who receive latest news and updates from our monthly curated newsletter.
Sign upCome for support, stay for the community
Get support from experts, connect with like-minded developers, and access exclusive virtual events.
Join Developer DiscordAs a consumer using closed AI models, there are not many parameters for one to adjust other than model choice, and perhaps the verbal tone of responses. When using an open-source AI model served up through any typical inference layer, including the Qualcomm AI Inference Suite, one can choose to modify several parameters that affect the output.
Temperature, Top P, Top K, Repetition Penalty, and Max Tokens: what do they all mean and when should they be adjusted? If you were to look at the API, there are even more parameters to adjust, but these are the ones that most developer playgrounds expose through a chat UI. If one were tuning a prompt to do actual work, how should these parameters be set? Are the defaults the best ones to stick with?
In this sample, we create a way to compare two different LLM models and the ability to independently modify the parameters of each. The idea is to help understand how different models and parameters affect the output given the same prompt. You can use this to stick with the defaults and simply compare models or go further to tweak settings for your use case.
What goal do you have for output?
To figure out how to set these parameters it is perhaps best to first consider what kind of output the large language model (LLM) should produce. Most descriptions of what these parameters do focus on two extremes: 1) structured factual answers - more deterministic, or 2) creative output with diversity of expression or ideas.
So, if you are summarizing text, then perhaps you want to stick to the facts with a professional tone. If you are generating marketing copy for social media posts, then you may want to have more creative output so that there are many options to choose from. For writing code, one doesn’t want ‘creative’ code; code that works and is easy to understand is more valuable. For simple evaluations like customer sentiment analysis, the defaults may be adequate.
Definitions
Max tokens: This sets the maximum output to generate. However, since the size of a token may differ from model to model, it doesn’t strictly correlate with character count, words, or some other human-centric idea of counting. What it does is cut off the LLM content generation if set too low. For some prompts where the expected output is short due to clear instructions and a desired output format, it can save on unnecessary output token generation – possibly also reducing cost.
For example: If we use the example prompt below, then we expect the output to be a single word, so our Max Tokens value can be set low. Experimentation is key to discovering the right value for Max Tokens that meets your business use case. Why produce tokens you don’t need?
Prompt> Please evaluate the following customer feedback and answer only with positive, negative, or neutral. Only answer positive, negative, or neutral in lowercase.
Feedback: ‘We loved this hotel and will be coming back again!’
LLM response> positive
Temperature: Temperature is often described as controlling randomness of token selection. In practice that means if it is set to zero, it will always pick the most likely next token - i.e., a factual answer based on training. If set higher, then it introduces randomness, meaning that the generated text should be more ‘creative.’ Setting this at 0.7 is common for creative tasks like ideation.
An example of a prompt where creativity is valued may be something like, “You are a copywriting and marketing expert. Come up with 20 ideas to showcase my new widget. Here are the facts about the widget and its uses: fact 1, 2, etc.”
When comparing the output of the same model with a higher Temperature, you will see that there are more ideas with a wider variety of words and creativity. Conversely, setting Temperature to zero combining with a low Top P (explained below) results in a narrow set of recommendations that tend to repeat facts that were in the prompt.
Repetition Penalty: Behind the scenes, this creates a penalty for repeating tokens or phrases. Setting too much below one will most likely result in the same phrase repeating over and over. (possibly not related to the prompt). Set too high, the answer might seem to be answering like a person with Alzheimer’s. It literally forgets one thread of reasoning and goes off in another direction - sometimes more than once. Small modifications plus or minus a tenth from 1.1 can make a difference - but most likely it should remain at the default of 1.1.
Top P: Top P sampling, also called nucleus sampling, is a decoding strategy used by LLMs to control randomness in the created text. Instead of picking the next word in a sequence from the entire vocabulary, it sorts possibilities and then chooses a subset according to the cumulative probability of fitting in the Top P percentage of options. Thus, a higher value means more options (words) and more creativity, while a lower Top P is more likely to pick from a smaller set of 'safe reliable' choices for the next word.
The idea is similar to how the Temperature setting works. For the most creative response, a high Temperature with a high Top P can be combined. That way, there will be variability in how words are chosen relative to the training data set.
Top K: Top K is very similar in effect to Top P in that it limits the number of actual tokens to chose from for the next token. If Top P uses probability sorting, Top K uses an absolute number of tokens to limit the LLM's choice of next token(s) to output. Again, a higher Top K value means more creativity.
Typical combinations
To sum it up for typical scenarios:
- Higher Temperature, Top P, Top K: produces a more creative output with diverse word choice
- Lower Temperature, Top P, Top K: good for more deterministic and factual output
- Repetition Penalty: adjust to slightly higher for code output to reduce loops
All of these settings can affect the output, so if you are brainstorming – playing around with them and regenerating output from the same prompt can produce more possibilities for you. This is especially useful if your goal is having lots of choices rather than a one-shot answer.
A tool for comparing
One shortcoming of typical LLM chat interfaces is that you can only generate one response at a time. And when you modify these parameters, the output doesn’t capture which combination of parameters you used to generate the response.
To better understand and visualize their effect, we have created a code sample at this GitHub repo that can compare responses across not only parameters but also LLMs from any provider endpoint supporting the standard OpenAI API chat interface. For our sample, we are hitting the Qualcomm AI Inference Suite powered by Qualcomm AI accelerators and hosted by our partner Cirrascale.
Try it out yourself
Using the sample code at this GitHub repo you can test it out for yourself. Try changing the endpoint configuration file to sample LLMs hosted by different companies. Or change the prompt to tasks where you want creativity or where you want a defined answer.
Use the settings of Configuration A and B differently to see how changing them affects the output or to just compare two different LLMs.
After using this sample, let us know over on the Qualcomm Cloud AI Discord channel what you’ve created for your own scenarios.
Be sure to sign up for free tokens and retrieve your API key from our partner Cirrascale.
Explore other topics in the Cloud AI blog series.

