Full Attention Bidirectional Deep Learning Structure for Single Channel Speech Enhancement

Yuzi Yan,Wei-Qiang Zhang,Michael T. Johnson
DOI: https://doi.org/10.48550/arXiv.2108.12105
2021-08-27
Abstract:As the cornerstone of other important technologies, such as speech recognition and speech synthesis, speech enhancement is a critical area in audio signal processing. In this paper, a new deep learning structure for speech enhancement is demonstrated. The model introduces a "full" attention mechanism to a bidirectional sequence-to-sequence method to make use of latent information after each focal frame. This is an extension of the previous attention-based RNN method. The proposed bidirectional attention-based architecture achieves better performance in terms of speech quality (PESQ), compared with OM-LSA, CNN-LSTM, T-GSA and the unidirectional attention-based LSTM baseline.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the speech quality in single - channel speech enhancement. Specifically, the author proposes a new deep - learning structure. By introducing the "full" attention mechanism into the bidirectional sequence - to - sequence method, it can utilize the latent information after each focus frame. This method is an extension of the previous attention - based Recurrent Neural Network (RNN) methods, aiming to overcome the limitations of the traditional unidirectional attention mechanism, so as to achieve better performance in speech quality evaluation (such as the PESQ metric), especially in irregular noise conditions and low Signal - to - Noise Ratio (SNR) situations. The main contributions of the paper are as follows: - **Bidirectional full - attention mechanism**: It combines forward and backward information. By means of the full - attention mechanism, it makes full use of the context information and improves the ability to capture the relevance of the focus frames. - **Mel - frequency features**: It uses the Mel - frequency Filter Bank (FBank) features to represent the audio sequence, which reduces the number of weights to be estimated in the system and simplifies the model complexity. - **Experimental verification**: Extensive experiments have been carried out on two public speech databases, THCHS - 30 and QUT - NOISE - TIMIT, proving that the proposed bidirectional full - attention model is superior to other baseline methods in most cases, especially performing better in irregular noise conditions. Through these innovations, this research provides new ideas and technical support for the development of speech enhancement technology.