Abstract:Multi-channel speech enhancement aims at extracting the desired speech using a microphone array, which has many potential applications, such as video conferencing, automatic speech recognition, and hearing aids. Recently, deep learning-based spatial filters have achieved remarkable improvements over traditional beamformers, and the desired speech is often inferred directly using the noisy features without modeling the interference. In this work, a novel two-stage framework is proposed to extract the desired speech under the guidance of both the estimated interference and the desired signal. The resulting framework, called a Separation and Interaction Network (SI-Net), includes two components: the first module separates speech and interference coarsely, and the second sub-network serves as the post-processing module to suppress the residual noise and regenerate some missing speech components simultaneously under the guidance of previously estimated speech and interference characters. Because these two modules are both differentiable, the proposed framework can be trained in an end-to-end manner. In addition, a causal spatial-temporal attention module is designed to effectively model the inter-channel and inter-frame correlations simultaneously. Moreover, under this framework, we adopt the channel shuffle and gated fusion strategies for the interaction between speech and interference components to deliver the knowledge about both “where to suppress and where to enhance”. Experiments conducted on the simulated multi-channel speech dataset illustrate the superiority of the proposed framework over state-of-the-art baselines, while can still support real-time processing.

CUSIDE-array: A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations

CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

A CIF-Based Speech Segmentation Method for Streaming E2E ASR

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

A Streaming End-to-End Framework for Spoken Language Understanding.

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.

Channel selection using neural network posterior probability for speech recognition with distributed microphone arrays in everyday environments

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Real-time End-to-End Monaural Multi-speaker Speech Recognition

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

SSCFormer: Push the Limit of Chunk-Wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution

A Separation and Interaction Framework for Causal Multi-Channel Speech Enhancement.

Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend