Abstract:Multi-channel speech enhancement aims at extracting the desired speech using a microphone array, which has many potential applications, such as video conferencing, automatic speech recognition, and hearing aids. Recently, deep learning-based spatial filters have achieved remarkable improvements over traditional beamformers, and the desired speech is often inferred directly using the noisy features without modeling the interference. In this work, a novel two-stage framework is proposed to extract the desired speech under the guidance of both the estimated interference and the desired signal. The resulting framework, called a Separation and Interaction Network (SI-Net), includes two components: the first module separates speech and interference coarsely, and the second sub-network serves as the post-processing module to suppress the residual noise and regenerate some missing speech components simultaneously under the guidance of previously estimated speech and interference characters. Because these two modules are both differentiable, the proposed framework can be trained in an end-to-end manner. In addition, a causal spatial-temporal attention module is designed to effectively model the inter-channel and inter-frame correlations simultaneously. Moreover, under this framework, we adopt the channel shuffle and gated fusion strategies for the interaction between speech and interference components to deliver the knowledge about both “where to suppress and where to enhance”. Experiments conducted on the simulated multi-channel speech dataset illustrate the superiority of the proposed framework over state-of-the-art baselines, while can still support real-time processing.

A Lightweight Hybrid Multi-Channel Speech Extraction System with Directional Voice Activity Detection

A Speech Enhancement System for Automotive Speech Recognition with a Hybrid Voice Activity Detection Method

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition

Time-Domain Speech Extraction with Spatial Information and Multi Speaker Conditioning Mechanism

Single Channel Speech Enhancement Algorithm Based on BLSTM-DNN Bidirectional Optimized Hybrid Model

Speech enhancement aided end-to-end multi-task learning for voice activity detection

DualSep: A Light-weight dual-encoder convolutional recurrent network for real-time in-car speech separation

A Separation and Interaction Framework for Causal Multi-Channel Speech Enhancement.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A Light Weight Model for Active Speaker Detection

Multi-channel Multi-frame ADL-MVDR for Target Speech Separation

VoAD: A Sub-μW Multiscene Voice Activity Detector Deploying Analog-Frontend Digital-Backend Circuits

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

Directional Sound-Capture System with Acoustic Array Based on FPGA

A Hybrid Deep-Learning Approach for Single Channel HF-SSB Speech Enhancement

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

A Unified DNN Approach to Speaker-Dependent Simultaneous Speech Enhancement and Speech Separation in Low SNR Environments

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement