Abstract:Keyword spotting remains a challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and false recognition. To solve this problem, a novel lip descriptor is presented involving both geometry-based and appearance-based features in this paper. Specifically, a set of geometry-based features is proposed based on an advanced facial landmark localization method. In order to obtain robust and discriminative representation, a spatiotemporal lip feature is put forward concerning similarities among textons and mapping the feature to intra-class subspace. Moreover, a parallel two-step keyword spotting strategy based on decision fusion is proposed in order to make the best use of audio-visual speech and adapt to diverse noise conditions. Weights generated using a neural network combine acoustic and visual contributions. Experimental results on the OuluVS dataset and PKU-AV dataset demonstrate that the proposed lip descriptor shows competitive performance compared to the state of the art. Additionally, the proposed audio-visual keyword spotting (AV-KWS) method based on decision-level fusion significantly improves the noise robustness and attains better performance than feature-level fusion, which is also capable of adapting to various noisy conditions.

Personalizing Keyword Spotting with Speaker Information

Audio-visual Keyword Spotting Based on Adaptive Decision Fusion under Noisy Conditions for Human-Robot Interaction.

Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding

A Novel Lip Descriptor for Audio-Visual Keyword Spotting Based on Adaptive Decision Fusion

Speech Personality Recognition Based on Annotation Classification Using Log-Likelihood Distance and Extraction of Essential Audio Features.

Keyword Spotting for Hearing Assistive Devices Robust to External Speakers

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

Audio-Visual Multi-person Keyword Spotting Via Hybrid Fusion

LEXICAL ACCESS-BASED CONFIDENCE MEASURE FOR A SPANISH KEYWORD SPOTTING SYSTEM

State-of-the-art in speaker recognition

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Enhancing multilingual speech recognition in air traffic control by sentence-level language identification

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Bridging the Gap Between Audio and Text Using Parallel-Attention for User-Defined Keyword Spotting

Keyword-specific normalization based keyword spotting for spontaneous speech

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Focal Loss And Double-Edge-Triggered Detector For Robust Small-Footprint Keyword Spotting

SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting

Robust Dual-Modal Speech Keyword Spotting for XR Headsets