Abstract:Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and false recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-based and appearance-based features in this paper. Specifically, a set of geometry-based features is proposed based on an advanced facial landmark localization method. In order to obtain robust and discriminative representation, a spatiotemporal lip feature is put forward concerning similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel two-step keyword spotting strategy based on decision fusion is proposed in order to make the best use of audio-visual speech and adapt to diverse noise conditions. Weights generated using a neural network combine acoustic and visual contributions. Experimental results on the OuluVS dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive performance compared to the state of the art. Additionally, the proposed audio-visual keyword spotting (AV-KWS) method based on decision-level fusion significantly improves the noise robustness and attains better performance than feature-level fusion, which is also capable of adapting to various noisy conditions.

A multimodel keyword spotting system based on lip movement and speech features

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Multi-Grained Spatio-temporal Modeling for Lip-reading

Sub-word Level Lip Reading With Visual Attention

Seeing wake words: Audio-visual Keyword Spotting

Lip Movement Detection Using 3D Convolution and Resnet

Deep Audio-visual Speech Recognition

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Learning Speaker-specific Lip-to-Speech Generation

Efficient DNN Model for Word Lip-Reading

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Learn an Effective Lip Reading Model without Pains

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

Mmmic: Multi-modal Speech Recognition Based on Mmwave Radar.

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

Audio-Visual Keyword Spotting Based on Multidimensional Convolutional Neural Network

Intuitive Perception - Speech Recognition using Machine Learning

Audio-visual Keyword Spotting Based on Adaptive Decision Fusion under Noisy Conditions for Human-Robot Interaction.

Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading