Abstract:Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.

What problem does this paper attempt to address?

This paper attempts to solve the key problems in multi - channel speech enhancement, especially how to separate and enhance the target speech more effectively in noisy and reverberant environments. Traditional methods usually integrate multi - channel information at a later stage, while this paper proposes a new variant of the U - Net architecture - RelUNet (Relative Channel Fusion U - Net), which processes multi - channel signals by combining relative information from the very beginning, thus fusing cross - channel information earlier and improving the effect of speech enhancement. ### Specific problem description 1. **Challenges in multi - channel speech enhancement**: - In complex environments (such as multiple sound sources and reflections), traditional single - channel speech enhancement methods are difficult to effectively separate the target speech. - Multi - channel microphone arrays can provide more spatial information, which is helpful for more accurate separation of the target speech, but existing methods usually integrate this information at a later stage. 2. **Limitations of existing methods**: - Traditional signal processing methods (such as beamforming) rely on accurate spatial information (such as time differences, noise covariance matrices, etc.), but it is difficult to obtain accurate estimates in noisy and reverberant conditions. - Although deep learning methods can learn complex features from data, most models process each channel independently in the early stage and fail to fully utilize the relative information between multi - channels. ### Solutions proposed in the paper The paper proposes RelUNet, an improved model based on the U - Net architecture, which solves the above problems in the following ways: 1. **Introduction of relative information**: - RelUNet starts using relative information from the input stage by stacking each channel with a reference channel. This strategy enables the network to capture the comparative differences between channels at an early stage, so as to better fuse spatial information. 2. **Improved information fusion mechanism**: - By inserting graph neural networks (GNNs), such as graph convolutional networks (GCN) and graph attention networks (GAT), between the encoder and the decoder, the fusion of cross - channel information is further enhanced. 3. **Experimental verification**: - Experiments are carried out using the CHiME - 3 data set, and the results show that RelUNet significantly improves the performance of speech enhancement in various noisy environments, especially showing excellent performance in evaluation metrics such as PESQ and STOI. ### Summary This paper aims to solve the key problems in multi - channel speech enhancement by introducing relative information and an improved information fusion mechanism, especially how to use the spatial information of multi - channel signals more effectively, thereby improving the quality and robustness of speech enhancement.

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss

A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement

Convolutional fusion network for monaural speech enhancement

Supervised Single Channel Speech Enhancement Method Using UNET

A time-frequency fusion model for multi-channel speech enhancement

Two-stage unet with channel and temporal-frequency attention for multi-channel speech enhancement

A Feature Integration Network for Multi-Channel Speech Enhancement

Multi-channel U-Net for Music Source Separation

A Multi-scale Subconvolutional U-Net with Time-Frequency Attention Mechanism for Single Channel Speech Enhancement

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement

Multichannel Speech Enhancement without Beamforming

SpatialNet: Extensively Learning Spatial Information for Multichannel Joint Speech Separation, Denoising and Dereverberation

Single-Channel Speech Enhancement with Deep Complex U-Networks and Probabilistic Latent Space Models

U-Former: Improving Monaural Speech Enhancement with Multi-head Self and Cross Attention

A Subconvolutional U-net with Gated Recurrent Unit and Efficient Channel Attention Mechanism for Real-Time Speech Enhancement

FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation