May 17, 2021
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
Sound recognition is quickly becoming established as a core AI technology in consumer tech, alongside image and voice recognition. As with these other applications of AI, sound recognition opens up new opportunities for smartphone OEMs to enrich the consumer experience by offering innovative features that improve convenience or create opportunities for consumers to express themselves.
During the 2020 Snapdragon Tech Summit, we announced that the 2nd Generation Qualcomm Sensing Hub on the Qualcomm Snapdragon 888 5G Mobile Platform was pre-validated to run Audio Analytic’s Acoustic Scene Recognition technology in always-on, low-power mode alongside the Qualcomm Aqstic audio codec, which supports all major voice assistants.
But what exactly is sound recognition, what impact could it have on the smartphone experience and how can it bring value to the consumer? I spoke to Neil Cooper, vice president of marketing at Audio Analytic, to find out more.
Q: What is sound recognition and what are its applications?
When we talk about sound recognition, we are talking about the ability to accurately recognize sounds beyond just speech and music. Although consumers are familiar with music recognition and voice recognition, they are just two narrow types of sounds. We developed a thorough categorization of sounds (biophonic, geophonic, and anthrophonic) that goes well beyond music and voice.
Sound recognition brings “the sense of hearing” to smartphones, and it enables new use-cases:
- Audio event detection – where you are looking to detect the presence of a key sound and alert somebody or modify device settings, such as recognizing a knock at the door and then alerting the user if they have their headphones on while gaming.
- Scene recognition – where the combinations of sounds and the acoustic characteristics of the space provide contextual information about your location, such as a chaotic coffee shop or a calm home office – or vice versa. The smartphone’s user interface can then be adapted by automatically changing the notification settings, among other UI aspects.
- Sound tagging – where you are identifying whether audio contains certain sounds, such as the video you take on your mobile phone that features a person laughing. This enables you to find that video in the future based on the sounds it contains.
Our vision is that all consumer products will have a sense of hearing. This includes smartphones, smart speakers, hearables, doorbells, smart home cameras, and many other categories.
By embedding our sound recognition inference platforms (ai3 and ai3-nano into products, we provide important contextual information, giving devices the ability to react to the world around us. This helps satisfy consumer needs in areas such as entertainment, safety, security, well-being, convenience, accessibility, and communication. In all cases, this creates positive and valuable consumer experiences.
Q: You talk about sound recognition having a positive impact on consumer experiences and touch on the three sound recognition tasks. What impact will it have on how we interact with smartphones and how they make sense of the world around us? Can you share some examples?
Sound recognition will have a massive impact on how smartphones interact with consumers and the world around us. This embedded, edge-based AI capability empowers OEMs to deliver a wave of innovative capabilities that consumers will love. Here are a few examples of what I mean:
Acoustic Scene Recognition - No more embarrassing moments when you’ve left a chaotic coffee shop and returned to the calm office, only for your phone to ring at maximum volume at the most inappropriate moment. And on the flip-side, no more missed calls because your phone is still on vibrate when you’re in a busy bar. This technology, which has been optimized for use with the Snapdragon 888 SoC, is best seen as an enabling technology. As well as intelligently adjusting notification, UI and call settings, there are many other innovative things that this sense of hearing could support. Because it runs on the Qualcomm Sensing Hub and not in the cloud, devices can react 24/7 and alongside wake word detection, giving consumers the benefits of both technologies without draining battery life.
Media Tagging - Automatically tag the audio content of videos and photos to enable creative editing, social sharing, or easy retrieval. Searching for videos or pictures is not a smooth experience for consumers now that we’ve amassed thousands of them. Quickly finding that special moment where your child was laughing on holiday is incredibly attractive to consumers who want to relive and share these precious moments. Media Tagging also supports the ability to edit and share creative content across social media that takes advantage of sound-related effects and filters applied; for example, when a guitar is played. This enables consumers to express themselves creatively on the fly.
Sound + AR Gaming – Video calls entered all of our lives in 2020, and users can have fun by applying extra features to their faces. However, the process of finding and applying a filter or effect interrupts the user experience. By combining sound recognition and augmented reality, consumers can trigger effects and filters through the sounds they make. This introduces spontaneous silliness and helps people to share enjoyable moments, regardless of their age or location.
Accessibility – According to the National Institute on Deafness and Other Communication Disorders (NIDCD), approximately 15% of American adults report some trouble hearing. In addition, thanks to the rapid growth of mobile gaming, the success of the true wireless earbud form factor, and the development of active noise-canceling technology, we are spending increasing proportions of our day isolated from the world around us. When we wear our headphones with noise cancellation technology, our sense of hearing is (temporarily) impaired. In this scenario, there is a clear role for smartphones to be that extra pair of ears, alerting users to the sound of danger, like smoke and CO alarms, or offer that helping hand around the house by alerting them to a knock at the door.
Q: Why is all of this done locally on the device?
Consumers demand that AI respects their privacy, especially with something as sensitive as sound. Our technology was designed from the outset to run at the edge of the network on-device, so sounds never have to leave a consumer’s device. Respecting consumer privacy means that users can embrace the benefits without any compromises. This approach has other advantages for OEMs as well, because they don’t need expensive cloud infrastructure to analyse and store sounds 24/7.
Q: That is an exciting range of applications and use cases. How do you train models to recognize such a broad number of sounds? Can you briefly explain some of the key technology building blocks?
When it comes to training and evaluating sound recognition systems, we must expose them to realistic real-world data. Quantity matters, as with all aspects of machine learning, but it is also about relevance and diversity. We had to build our dataset for the purpose called Alexandria, which has 30 million labeled sound recordings across 1,000 sound classes. There is no appropriate, commercially viable public data source. Furthermore, you can’t download or extract audio from the internet due to a range of legal, ethical, and technical limitations, which we’ve previously explained in a whitepaper.
When it comes to model training and evaluation, we’ve built AuditoryNET, our collection of specialised DNNs, frameworks and tools for the event, scene and tagging ML tasks in sound recognition. Within our ML training and evaluation toolbox are tools such as our patented Loss Function, which is built around the specific challenges presented by sound recognition, and our range of sound recognition specific model compression tools. In addition, our seminal work on evaluating sound recognition models has led us to publish our research on the Polyphonic Sound Detection Score (PSDS), which has become the industry’s standard tool for evaluating sound recognition systems.
Q: How can you embed an entire AI sound recognition system in a chip? What is the secret?
To be this compact, you have to optimize the entire ML pipeline for this objective – it isn’t just a case of compressing the software at the end. You have to collect the right type of data, you need network architectures built for the job and you need training and compression tools that are also appropriate to sound recognition tasks. You also need compact software and flexible inference engine architectures. To give you an example, ai3-nano running Acoustic Scene Recognition on the Snapdragon 888 5G Mobile Platform requires just 100kB of memory and around 1mA of power.
Q: Where can our readers go for more information?
If you want to find out more about the applications for sound recognition technology on smartphones, you can visit Audio Analytic’s website.