Whisper-Base-En Automatic Speech Recognition model is a state-of-the-art system designed for transcribing spoken language into written text. It exhibits superior performance in realistic, noisy environments, making it optimal for real-world applications. Specifically, it excels in long-form transcription, capable of accurately transcribing audio clips up to 30 seconds long.
Objective
The objective is to implement and evaluate Qualcomm® AI Hub’s Whisper ASR model on Qualcomm® Robotics RB5 platform to accurately transcribe spoken language into written text in real-world, noisy environments. The project aims to integrate the model into a user-friendly application, test its performance, optimize for enhanced accuracy, and demonstrate its practical use in various scenarios.
| Equipment | Description |
| Qualcomm Robotics RB5 | Qualcomm Robotics RB5 Development Kit |
| Power adapter | 12 V with 2500 mA required by the 96Boards specification |
| USB to Micro USB cable |
For serial console interface , to view logs |
| USB to USB Type C cable | For connecting the USB3.0 Type C port to the board and flashing images, adb and fastboot |
Source Code: GitHub Link for project source code
https://github.com/globaledgesoft/Whisper-ASR-Model-Compilation-and-Integration-on-RB5.git
Qualcomm Robotics RB5 Development Kit bring up
Introduction:
- Whisper-Base-En is an automatic speech recognition (ASR) model for English transcription as well as translation. It exhibits robust performance in realistic, noisy environments, making it optimal for real-world applications.
- Identify any limitations or challenges encountered during the implementation and testing phases.
- Explore potential optimizations to improve transcription accuracy and processing speed.
- Investigate techniques to enhance the model's performance in particularly noisy or challenging environments.
Technical Details:
- Model checkpoint: base.en
- Input resolution:80x3000 (30 seconds audio)
- Mean decoded sequence length:112 tokens
- Number of parameters (WhisperEncoder):23.7M
- Model size (WhisperEncoder):90.6 MB
- Number of parameters (WhisperDecoder):48.6M
- Model size (WhisperDecoder):186 MB
Whisper_Base_En Tflite Model Inference on Qualcomm Robotics RB5
Prerequisites:
- Ubuntu 20.04
- Conda Environment
- Python3.8
- TFLite
- Qualcomm Robotics RB5
Steps to Execute:
Step 1:Install Conda on wsl2 by using this link.
- After Installation, create the conda environment by using the given commands.
$ conda create --name <env_name> python=3.8- To check the conda environment list
$ conda env list- To activate conda environment
$ conda activate “env_name”Step 2: Install the required dependencies for Qualcomm Robotics RB5 to run the tflite inference.
- Install TFLite Runtime Library on the Qualcomm Robotics RB5.
- Install OpenCV-Python on the Qualcomm Robotics RB5.
Step 3: From this link, download the whisper_base.tflite model and save it in model directory
Step 4: Run Whisper-Base-En Tflite model inference with Recorded Audio file (audio.wav) as input on Qualcomm Robotics RB5.
$ python3 inference.pyStep 5: Run Whisper-Base-En Tflite model inference by taking audio input from external Mic on Qualcomm Robotics RB5 (instead of recorded Audio).
$ python3 inference_mic.pySign up for the Developer Newsletter.
Get software and hardware tool resources to help optimize your development delivered to your inbox weekly.
