Abstract:ABSTRACTWith the advance in automatic speech recognition, voice user interface has gained popularity recently. Since the COVID-19 pandemic, VUI is increasingly preferred in online communication due to its non-contact. Additionally, various ambient noise impedes the public applications of voice user interfaces due to the requirement of audio-only speech recognition methods for a high signal-to-noise ratio. In this paper, we present Wavoice, the first noise-resistant multi-modal speech recognition system that fuses two distinct voice sensing modalities, i.e., millimeter-wave (mmWave) signals and audio signals from a microphone, together. One key contribution is that we model the inherent correlation between mmWave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Furthermore, we elaborate on two novel modules into the neural attention mechanism for multi-modal signals fusion, and result in accurate speech recognition. Extensive experiments verify Wavoice's effectiveness under various conditions with the character recognition error rate below 1% in a range of 7 meters. Wavoice outperforms existing audio-only speech recognition methods with lower character error rate and word error rate. The evaluation in complex scenes validates the robustness of Wavoice.

Robust speech recognition in noisy backgrounds based on Teager energy operator and auditory process

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

A Noise Robust Front End Algorithm for Mandarin Speech Recognition and Performance Analysis

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Entropy of Energy Operator As Feature for Large Vocabulary Mandarin Speaker Independent Speech Recognition

Robust Speech Recognition by Selecting Mel-Filter Banks

Robust Log-Energy Estimation and Its Dynamic Change Enhancement for In-car Speech Recognition

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Robust Front-End for Speech Recognition Based on Computational Auditory Scene Analysis and Speaker Model

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

Modeling of Teager Energy Operated Perceptual Wavelet Packet Coefficients with an Erlang-2 PDF for Real Time Enhancement of Noisy Speech

High Performance Digit Mandarin Speech Recognition

Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments.

VTS-based Robust Speech Recognition

Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals

Energy-efficient MFCC Extraction Architecture in Mixed-Signal Domain for Automatic Speech Recognition

Speech Enhancement Based on Minimum Band Energy in Variable Noise-level Environments

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Robust Speech Recognition With Speech Enhanced Deep Neural Networks

Robust Speech Detection with Heteroscedastic Discriminant Analysis Applied to the Time-frequency Energy