Abstract:As automatic speech recognition evolves, deployment of the voice user interface (VUI) has boomingly expanded. Especially since the COVID-19 pandemic, the VUI has gained more attention in online communication owing to its non-contact property. However, the VUI struggles to be applied in public scenes due to the degradation of received audio signals caused by various ambient noises. In this article, we propose Wavoice , the first noise-resistant multi-modal speech recognition system that fuses two distinct voices sensing modalities (i.e., millimeter-wave signals and audio signals from a microphone) together. One key contribution is to model the inherent correlation between millimeter-wave and audio signals. Based on it, Wavoice facilitates the real-time noise-resistant voice activity detection and user targeting from multiple speakers. Additionally, we elaborate on two novel modules for multi-modal fusion embedded into the neural network, leading to accurate speech recognition. Extensive experiments prove the effectiveness of Wavoice under adverse conditions—that is, the character recognition error rate below 1% in a range of 7 m. In terms of robustness and accuracy, Wavoice considerably outperforms existing audio-only speech recognition methods with lower character error and word error rates.

Fine-Tuning Wav2Vec2 for Speaker Recognition

Robust Speaker Recognition with Transformers Using wav2vec 2.0

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

Online Speaker Adaptation for WaveNet-based Neural Vocoders

A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

Speaker recognition with two-step multi-modal deep cleansing

Phonetic-aware speaker embedding for far-field speaker verification