Abstract:Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and false recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-based and appearance-based features in this paper. Specifically, a set of geometry-based features is proposed based on an advanced facial landmark localization method. In order to obtain robust and discriminative representation, a spatiotemporal lip feature is put forward concerning similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel two-step keyword spotting strategy based on decision fusion is proposed in order to make the best use of audio-visual speech and adapt to diverse noise conditions. Weights generated using a neural network combine acoustic and visual contributions. Experimental results on the OuluVS dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive performance compared to the state of the art. Additionally, the proposed audio-visual keyword spotting (AV-KWS) method based on decision-level fusion significantly improves the noise robustness and attains better performance than feature-level fusion, which is also capable of adapting to various noisy conditions.

Audio-visual Keyword Spotting Based on Adaptive Decision Fusion under Noisy Conditions for Human-Robot Interaction.

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Audio-Visual Multi-person Keyword Spotting Via Hybrid Fusion

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Seeing wake words: Audio-visual Keyword Spotting

Audio–visual Keyword Transformer for Unconstrained Sentence‐level Keyword Spotting

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer

On-device audio-visual multi-person wake word spotting

Keyword Spotting Based on Hypothesis Boundary Realignment and State-Level Confidence Weighting

U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

Joint Decoding of Tandem and Hybrid Systems for Improved Keyword Spotting on Low Resource Languages

A perceptual manipulation system for audio-visual fusion of robots

Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting