One of the hottest gift ideas this holiday season are virtual assistants specifically those with a voice user interface (VUI).
These handy devices are becoming increasingly common in our daily lives since Siri was first introduced in 2011. Around 700 million people are using AI personal assistants and the market is expected to grow to almost 2 billion by 2021. There are multiple solutions out there from Siri to Google Assistant to Amazon Alexa and Microsoft Cortana. And Samsung has recently launched their Bixby assistant, while Facebook is expected to bring their own virtual assistant, simply called “M”, to commercialization next year.
As a developer, it is important to understand how these devices work and how to take advantage of their capabilities. Internally they’re powered with Bluetooth and Wi-Fi modules like the Qualcomm QCA9377-3 and processors such as the Qualcomm Snapdragon Mobile Platform. In this blog, we’re going to dive into how it all fits together.
Conversational and command-based interactions
A conversational interface is a user interface that mimics having a conversation with a human. Personal assistants come in two flavors: chatbots or text based interactions, and voice user interfaces (or voice activated assistants) like the commercial products indicated earlier. Voice activated assistants are typically command-based AI interactions - you ‘wake it up’ and tell it what to do.
Voice activated assistants are ideal for day-to-day tasks such as:
- Fact finding: Internet searches to find information, time of day and weather queries.
- Tasking: Setting alarms, sending messages, playing music & video, ordering things online, smart home coordination.
- Information gathering: Call centers collecting user information, healthcare providing initial diagnosis.
- Training: Learn a new language by conversing with an AI teacher.
Using a VUI bypasses the need for a keyboard, screen, and spellchecking which also makes it useful for hands-free communication as well as for accessibility needs.
The hardware components for voice based assistants include speakers & microphone, Bluetooth and Wi-Fi modules, and standard computer architecture (CPU, RAM). Although there’s a lot of technology in the device, the real brains usually reside in the cloud.
The easiest way to start writing apps that take advantage of VUI is to use a library such as Dialogflow which has integrations for all of the major players. If you want to delve deeper into the brains you can learn more about Natural Language Processing and machine learning in general.
To be effective with this technology as a developer and a designer, it is important to understand the process of the complete command interaction which is as follow:
- The virtual assistant is “woken up” using a trigger word (“Ok Google”, “Hey Siri”) to ensure that it only starts acting upon your command.
- Audio is recorded on the device, compressed, and streamed to the cloud over Wi-Fi. Noise reduction algorithms are often applied to the recorded audio so that the commands are more easily interpreted by the cloud processing.
- The audio is turned into text commands using a proprietary voice-to-text platform. Analog sound waves are converted to digital data by sampling the analog signal at a specified frequency. The digital data is analyzed to determine where the English phonemes (“bb”, “oo”, “sh”, etc.) occur. Once the phonemes are identified, a statistical modelling algorithm such as the Hidden Markhov Model, is used to determine the likelihood of a specific word.
- The text is processed using Natural Language Processing (NLP) to determine the desired action. The algorithm first uses part-of-speech tagging to determine which words are adjectives, verbs, nouns, etc. It combines this tagging with statistical machine learning models to deduce the meaning of the sentence.
- If the action requires further searches, then they are performed at this time. For example, “Hey Siri, what is the Snapdragon mobile platform?” would require an internet search to return the information. If the command is something like “Ok Google, send mom a message” then the command data (action: send message, recipient: mom) is sent back to the virtual assistant.
A reply is constructed in the cloud and the desired output words are retrieved from a database of speech samples. These words are stitched together to form a sentence and returned to the hardware to broadcast to the user.
What’ll be the next talk of the town?
Now that you know how voice-activated assistants work you can start building your own products. Why not try making a voice powered RC car or maybe a Christmas tree that responds do your child’s commands? With the power of voice recognition and the latest Qualcomm Technologies including our Bluetooth and Wi-Fi modules as well as our Qualcomm 3D Audio Tools, you can treat yourself to some fun new developer challenges over the holidays.