Abstract:Keyword Spotting (KWS) is the task of recognizing spoken command words from a database. With recent application human-machine interactions, KWS systems require real-time performance, where edge computing is a preferable option. To allow KWS systems to work on fast and real-time implementation, a low-complexity yet high-accurate AI model is mandatory. In this paper, we propose a comprehensive voice command recognition system design and its hardware implementation. The proposed AI model considered in this system is SpectroNet-based and an efficient hybrid CNN-LSTM architecture with low complexity. Jetson Xavier NX is an edge device because of its strong computational power as an embedded device. The implementation result shows the proposed method offers quite good in terms of accuracy, indicated by no accuracy drop between the model implemented in PC and Jetson Xavier. However, the inference time is quite high, which is 180 ms/step. To improve the speed of the system, the TensorRT library is used to further optimize the model. Optimization of the model is found effective, reducing 59.35% of the total operation performed in SpectroNet when FP32 precision is used, and 59.63% when FP16 precision is used. The model is also sped up by 45% if FP32 precision mode is used and 62% if FP16 precision mode is used. However, there is a slight accuracy drop of 2.68% if FP32 precision mode is used and 4.84% if FP16 precision mode is used. This slight drop in accuracy is considered negligible compared to the performance boost that TensorRT gives. The work is useful for intelligent control systems such as smart vehicles, smartphones, computers, and smart communications.

Robust Small-Footprint Keyword Spotting Using Sequence-To-Sequence Model With Connectionist Temporal Classifier

Small-footprint Keyword Spotting with Graph Convolutional Network

Compact Feedforward Sequential Memory Networks For Small-Footprint Keyword Spotting

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

Frequency & Channel Attention Network for Small Footprint Noisy Spoken Keyword Spotting

A Spiking Neural Network System for Robust Sequence Recognition

Keyword Spotting Based on Syllable Confusion Network.

Hello Edge: Keyword Spotting on Microcontrollers

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-footprint Keyword Spotting

End-to-end keywords spotting based on connectionist temporal classification for Mandarin

Exploring Representation Learning for Small-Footprint Keyword Spotting

NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting

9.1 μW keyword spotting processor based on optimized MFCC and small‐footprint TENet in 28‐nm CMOS

A Multi-Spike Approach For Robust Sound Recognition

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

DCCRN-KWS: an audio bias based model for noise robust small-footprint keyword spotting

Efficient Real-Time Smart Keyword Spotting Using Spectrogram-Based Hybrid CNN-LSTM for Edge System

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer