Abstract:In recent years, significant advancements have been made in neural beamforming, leveraging spectral and spatial cues to enhance their performance in multi-channel speech enhancement. When the frame-wise processing mechanism is required, there exists a trade-off between performance and algorithmic delay for existing all-neural beamformers. However, from the perspective of multi-source information fusion, the network is often encapsulated into a black box to entangle and fuse the spatial and spectral features into a non-linear feature space, which hinders our understanding of how they work collaboratively for target speech extraction. In this regard, this paper proposes to decouple the spatial and spectral domain processing inspired by Taylor's approximation theory. Specifically, we reformulate the time-variant beamforming defined in the spatial domain into the adaptive weighting and mixing of different beam components in the beamspace domain. This reformulation enables us to model the recovery of target speech as a weighted sum operation in the beamspace domain, where each beam component is associated with an introduced unknown term for residual interference cancellation. By virtue of Taylor's series expansion, the recovery process can be decomposed into the superimposition of the 0th-order non-derivative and high-order derivative terms, where the former acts as spatial filtering in the spatial domain, and the latter serves as a residual interference canceller in the spectral domain. We conduct extensive experiments on the spatialized LibriSpeech and L3DAS Challenge datasets. Experimental results show that, compared with existing advanced approaches, the proposed method not only achieves competitive performance in terms of multiple objective metrics but also provides feasible guidance in multi-channel speech enhancement pipeline design.

Masking-based Neural Beamformer for Multichannel Speech Enhancement

Deep Learning Based Speech Beamforming

A New Neural Beamformer for Multi-channel Speech Separation

Neural Spatio-Temporal Beamformer for Target Speech Separation

Attention-Based Beamformer For Multi-Channel Speech Enhancement

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

Subspace Hybrid MVDR Beamforming for Augmented Hearing

Subspace Hybrid Beamforming for Head-worn Microphone Arrays

Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

A Corpus-Based Evaluation of Beamforming Techniques and Phase-Based Frequency Masking

A High-Resolution and Low-Frequency Acoustic Beamforming Based on Bayesian Inference and Non-Synchronous Measurements

A Dual-Channel Beamformer Based on Time-Delay Compensation Estimator and Shifted PCA for Speech Enhancement.

Multichannel Speech Enhancement without Beamforming

TaBE: Decoupling spatial and spectral processing with Taylor's unfolding method in the beamspace domain for multi-channel speech enhancement

Benefits of triple acoustic beamforming during speech-on-speech masking and sound localization for bilateral cochlear-implant users

Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

Multi-channel end-to-end neural network for speech enhancement, source localization, and voice activity detection

A Speech Enhancement System Based on Real-time Sound Source Localization and Super-directional Fixed Beamforming

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition