Abstract:Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.

Phase Spectrum Recovery for Enhancing Low-Quality Speech Captured by Laser Microphones

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Speech Acquisition and Recovery Based on Piezoelectric Effect in the Mmwave Band

High-fidelity Acoustic Signal Enhancement for Phase-Otdr Using Supervised Learning

Speech Enhancement Based On Analysis Synthesis Framework With Improved Pitch Estimation And Spectral Envelope Enhancement

A Speech Enhancement Algorithm for Speech Reconstruction Based on Laser Speckle Images

A Speech Enhancement Method Based on Dual-Path Phase-Aware GAN Networks

Speech Enhancement Based on Array-processing-assisted Distributed Fiber Acoustic Sensing

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Speech Enhancement Based on Analysis–Synthesis Framework with Improved Parameter Domain Enhancement

MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra

Noise Estimation Using Mean Square Cross Prediction Error for Speech Enhancement

Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation

pDenoiser: A Personalized Speech Enhancement Neural Network for Pre-hospital Emergency Medical Services.

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Magnitude-and-phase-aware Speech Enhancement with Parallel Sequence Modeling

A Speech Enhancement Algorithm Using Computational Auditory Scene Analysis with Spectral Subtraction

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

LPCSE: Neural Speech Enhancement through Linear Predictive Coding

Speech enhancement from fused features based on deep neural network and gated recurrent unit network