Abstract:Although keyword spotting (KWS) technologies have been successfully applied to some applications, most KWS systems have a common problem of noise-robustness when applied to real-world environments. Audio-visual keyword spotting (AVKWS) using both acoustic and visual information is a solution to complementarily solve the problem. Most existing audio-visual speech recognition (AVSR) systems extract geometric features as visual features, which heavily rely on accurate and reliable detection and tracking of facial feature points. To avoid this defect of geometric features, an appearance-based discriminative local spatial-temporal descriptor (disCLBP-TOP) is proposed in this paper, which devotes to extracting robust and discriminative patterns of interest. Besides, a parallel two-step recognition based on both acoustic and visual keyword searching and re-scoring is conducted, which complementarily makes the best of two modalities under different noisy conditions. Adaptive weights for decision fusion are generated using a sigmoid function based on reliabilities of the two modalities, capable of adapting to various noisy conditions. Experiments show that our proposed parallel AVKWS strategy based on decision fusion significantly improves the noise robustness and attains better performance than feature fusion based audio-visual spotter. Additionally, disCLBP-TOP shows more competitive performance than CLBP-TOP.

Focal Loss And Double-Edge-Triggered Detector For Robust Small-Footprint Keyword Spotting

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework Based on Cascaded Transducer-Transformer.

Text Adaptive Detection for Customizable Keyword Spotting.

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Keyword Spotting Based on Hypothesis Boundary Realignment and State-Level Confidence Weighting

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting

Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection

A Depthwise Separable Convolution Neural Network for Small-footprint Keyword Spotting Using Approximate MAC Unit and Streaming Convolution Reuse

A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting.

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Re-Weighted Interval Loss for Handling Data Imbalance Problem of End-to-End Keyword Spotting.

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

A Two-Step Keyword Spotting Method Based on Context-Dependent a Posteriori Probability

Multi-class AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training Data

Frequency & Channel Attention Network for Small Footprint Noisy Spoken Keyword Spotting

A Few-Shot Speech Keyword Spotting Method Based on Self-Supervise Learning.

Integration of Multi-Look Beamformers for Multi-Channel Keyword Spotting.

Effective Combination of DenseNet and BiLSTM for Keyword Spotting.

AAD-KWS: A Sub-μ W Keyword Spotting Chip with an Acoustic Activity Detector Embedded in MFCC and a Tunable Detection Window in 28-Nm CMOS

AAD-KWS: a Sub- $\mu\mathrm{w}$ Keyword Spotting Chip with a Zero-Cost, Acoustic Activity Detector from a 170nw MFCC Feature Extractor in 28nm CMOS