Abstract:Although keyword spotting (KWS) technologies have been successfully applied to some applications, most KWS systems have a common problem of noise-robustness when applied to real-world environments. Audio-visual keyword spotting (AVKWS) using both acoustic and visual information is a solution to complementarily solve the problem. Most existing audio-visual speech recognition (AVSR) systems extract geometric features as visual features, which heavily rely on accurate and reliable detection and tracking of facial feature points. To avoid this defect of geometric features, an appearance-based discriminative local spatial-temporal descriptor (disCLBP-TOP) is proposed in this paper, which devotes to extracting robust and discriminative patterns of interest. Besides, a parallel two-step recognition based on both acoustic and visual keyword searching and re-scoring is conducted, which complementarily makes the best of two modalities under different noisy conditions. Adaptive weights for decision fusion are generated using a sigmoid function based on reliabilities of the two modalities, capable of adapting to various noisy conditions. Experiments show that our proposed parallel AVKWS strategy based on decision fusion significantly improves the noise robustness and attains better performance than feature fusion based audio-visual spotter. Additionally, disCLBP-TOP shows more competitive performance than CLBP-TOP.

Keyword-Specific Acoustic Model Pruning for Open-Vocabulary Keyword Spotting

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

Weight-importance sparse training in keyword spotting

Keyword-specific normalization based keyword spotting for spontaneous speech

DCCRN-KWS: an audio bias based model for noise robust small-footprint keyword spotting

A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

A New Keyword Spotting Approach for Spontaneous Mandarin Speech

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework Based on Cascaded Transducer-Transformer.

A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Web-based keyword adapted Language Modeling for Keyword Spotting

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Exploring Representation Learning for Small-Footprint Keyword Spotting

NS-KWS: joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting

Open-vocabulary Keyword-spotting with Adaptive Instance Normalization

Subword scheme for keyword search

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting

Keyword Spotting Based on Phoneme Confusion Matrix

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.