Abstract:In this study, a novel multi-objective speech enhancement algorithm is proposed. First, we construct a deep learning architecture based on a stacked and temporal convolutional neural network (STCNN). Second, the main log-power spectra (LPS) features are input into a stacked convolutional neural network (SCNN) to extract advanced abstract features. Third, an improved power function compression Mel-frequency cepstral coefficient (PC-MFCC) feature—more consistent with human hearing characteristics than a Mel-frequency cepstral coefficient (MFCC)—is proposed. Then, a temporal convolutional neural network (TCNN) uses PC-MFCC and learned features from SCNN as input, and separately predicts a clean LPS, PC-MFCC and Ideal Ratio Mask (IRM). In this training phase, PC-MFCC constrains the LPS and IRM through a loss function to obtain the optimal network structure. Finally, IRM-based post-processing is used on the estimated clean LPS and IRM, which adjusts the weight between the above LPS and IRM to synthesise enhanced speech based on voice presence information. A series of experiments show that PC-MFCC is effective and shows complementarity with LPS in speech enhancement tasks. The proposed STCNN architecture has a higher speech enhancement performance than the comparative neural network models with good feature extraction and sequence modelling capabilities. Additionally, IRM-based post-processing further enhances the listening quality of reconstructed speech. Compared with the contrasting algorithm, the speech quality and intelligibility of enhanced speech based on the proposed multi-objective speech enhancement algorithm are further improved.

A Speech Enhancement Algorithm By Iterating Single- And Multi-Microphone Processing And Its Application To Robust Asr

Microphone array processing via joint wideband angle-of-arrival estimation and speech feature enhancement

Multi-resolution Auditory Cepstral Coefficient and Adaptive Mask for Speech Enhancement with Deep Neural Network

A Multi-Objective Learning Speech Enhancement Algorithm Based on IRM Post-Processing with Joint Estimation of SCNN and TCNN

End-to-End Dereverberation, Beamforming, and Speech Recognition in a Cocktail Party.

A Dual Microphone Speech Enhancement Method With A Smoothing Parameter Mask

Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

Speech Enhancement Algorithm Based on Microphone Array and Lightweight CRN for Hearing Aid

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Robust Mask Estimation by Integrating Neural Network-Based and Clustering-Based Approaches for Adaptive Acoustic Beamforming.

A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor

Deep Learning Based Real-Time Speech Enhancement for Dual-Microphone Mobile Phones

A Robust Speech Enhancement Method Based on Microphone Array

Neural Directed Speech Enhancement with Dual Microphone Array in High Noise Scenario

Multi-Channel Automatic Speech Recognition Using Deep Complex Unet

Masking based Spectral Feature Enhancement for Robust Automatic Speech Recognition

Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend

Multi-channel Speech Enhancement Based on the MVDR Beamformer and Postfilter.

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization