Abstract:Keyword Spotting (KWS) is the task of recognizing spoken command words from a database. With recent application human-machine interactions, KWS systems require real-time performance, where edge computing is a preferable option. To allow KWS systems to work on fast and real-time implementation, a low-complexity yet high-accurate AI model is mandatory. In this paper, we propose a comprehensive voice command recognition system design and its hardware implementation. The proposed AI model considered in this system is SpectroNet-based and an efficient hybrid CNN-LSTM architecture with low complexity. Jetson Xavier NX is an edge device because of its strong computational power as an embedded device. The implementation result shows the proposed method offers quite good in terms of accuracy, indicated by no accuracy drop between the model implemented in PC and Jetson Xavier. However, the inference time is quite high, which is 180 ms/step. To improve the speed of the system, the TensorRT library is used to further optimize the model. Optimization of the model is found effective, reducing 59.35% of the total operation performed in SpectroNet when FP32 precision is used, and 59.63% when FP16 precision is used. The model is also sped up by 45% if FP32 precision mode is used and 62% if FP16 precision mode is used. However, there is a slight accuracy drop of 2.68% if FP32 precision mode is used and 4.84% if FP16 precision mode is used. This slight drop in accuracy is considered negligible compared to the performance boost that TensorRT gives. The work is useful for intelligent control systems such as smart vehicles, smartphones, computers, and smart communications.

Speech Recognition: Keyword Spotting Through Image Recognition

Keyword spotting -- Detecting commands in speech using deep learning

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

Seeing wake words: Audio-visual Keyword Spotting

Voice Presentation Attack Detection Using Convolutional Neural Networks

A neural attention model for speech command recognition

Visually grounded learning of keyword prediction from untranscribed speech

Encoder-Decoder Neural Architecture Optimization for Keyword Spotting

An Interpretable and Generalizable Speech Detector Based on a CNN-LSTM Framework

Speech Recognition using Convolution Deep Neural Networks

Hello Edge: Keyword Spotting on Microcontrollers

Few-Shot Keyword Spotting With Prototypical Networks

Efficient Real-Time Smart Keyword Spotting Using Spectrogram-Based Hybrid CNN-LSTM for Edge System

Intuitive Perception - Speech Recognition using Machine Learning

Deep Learning Approaches for Understanding Simple Speech Commands

VIC-KD: Variance-Invariance-Covariance Knowledge Distillation to Make Keyword Spotting More Robust Against Adversarial Attacks

A multimodel keyword spotting system based on lip movement and speech features

Speech Augmentation Based Unsupervised Learning for Keyword Spotting

Audio Recognition using Mel Spectrograms and Convolution Neural Networks

Identification and Recognition of Speaker Voice Using a Neural Network-Based Algorithm

A focus module-based lightweight end-to-end CNN framework for voiceprint recognition