RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Ibrahim Aldarmaki,Thamar Solorio,Bhiksha Raj,Hanan Aldarmaki
2024-10-07
Abstract:Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the key problems in multi - channel speech enhancement, especially how to separate and enhance the target speech more effectively in noisy and reverberant environments. Traditional methods usually integrate multi - channel information at a later stage, while this paper proposes a new variant of the U - Net architecture - RelUNet (Relative Channel Fusion U - Net), which processes multi - channel signals by combining relative information from the very beginning, thus fusing cross - channel information earlier and improving the effect of speech enhancement. ### Specific problem description 1. **Challenges in multi - channel speech enhancement**: - In complex environments (such as multiple sound sources and reflections), traditional single - channel speech enhancement methods are difficult to effectively separate the target speech. - Multi - channel microphone arrays can provide more spatial information, which is helpful for more accurate separation of the target speech, but existing methods usually integrate this information at a later stage. 2. **Limitations of existing methods**: - Traditional signal processing methods (such as beamforming) rely on accurate spatial information (such as time differences, noise covariance matrices, etc.), but it is difficult to obtain accurate estimates in noisy and reverberant conditions. - Although deep learning methods can learn complex features from data, most models process each channel independently in the early stage and fail to fully utilize the relative information between multi - channels. ### Solutions proposed in the paper The paper proposes RelUNet, an improved model based on the U - Net architecture, which solves the above problems in the following ways: 1. **Introduction of relative information**: - RelUNet starts using relative information from the input stage by stacking each channel with a reference channel. This strategy enables the network to capture the comparative differences between channels at an early stage, so as to better fuse spatial information. 2. **Improved information fusion mechanism**: - By inserting graph neural networks (GNNs), such as graph convolutional networks (GCN) and graph attention networks (GAT), between the encoder and the decoder, the fusion of cross - channel information is further enhanced. 3. **Experimental verification**: - Experiments are carried out using the CHiME - 3 data set, and the results show that RelUNet significantly improves the performance of speech enhancement in various noisy environments, especially showing excellent performance in evaluation metrics such as PESQ and STOI. ### Summary This paper aims to solve the key problems in multi - channel speech enhancement by introducing relative information and an improved information fusion mechanism, especially how to use the spatial information of multi - channel signals more effectively, thereby improving the quality and robustness of speech enhancement.