2D-to-2d Mask Estimation for Speech Enhancement Based on Fully Convolutional Neural Network

Yan-Hui Tu,Jun Du,Chin-Hui Lee
DOI: https://doi.org/10.1109/icassp40776.2020.9054615
2020-01-01
Abstract:In recent years, the deep learning-based approaches are popular in the field of singe-channel speech enhancement. Convolutional neural networks (CNNs) are a standard component of many current speech enhancement system. In this study, we design a new Fully CNN (FCNN)-based regression model, which can directly achieve the 2-dimensional (2D) noisy lpg-power spectra (LPS) input to 2dimensional (2D) time-frequency mask output mapping, denoted as 2D-RFCNN. First, the whole 2D noisy LPS of one utterance is directly used as network input to make sure each convolutional filter can see more contextual information. Second, we only use the pooling operation on the frequency bin to ensure that the final dimension of frequency bin has a value of 1 and make the number of feature mapping same to frequency dimension, simultaneously. Finally, we also use the deep convolutional layers with a small size of filter, which is popularly used in speech recognition, for speech enhancement. Experiments of the CHiME-4 challenge task shows that our proposed 2D-RFCNN model not only improves the speech quality (PESQ) and intelligibility (STOI), but also reduces the recognition error rate on real test set.
What problem does this paper attempt to address?