Abstract:Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.

Extracting Spectral Features Using Deep Autoencoders with Binary Distributed Hidden Units for Statistical Parametric Speech Synthesis.

DBN-based Spectral Feature Representation for Statistical Parametric Speech Synthesis

GTDNN-Based Voice Conversion Using DAEs with Binary Distributed Hidden Units

Deep Belief Network-Based Post-Filtering For Statistical Parametric Speech Synthesis

EXTRACTING STRUCTURAL SPECTRAL FEATURES USING WHAT-WHERE AUTO-ENCODERS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Unseen Noise Estimation Using Separable Deep Auto Encoder for Speech Enhancement

Modeling Spectral Envelopes Using Deep Conditional Restricted Boltzmann Machines for Statistical Parametric Speech Synthesis.

Spectral Modeling Using Neural Autoregressive Distribution Estimators for Statistical Parametric Speech Synthesis

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

Statistical Parametric Speech Synthesis Using Generalized Distillation Framework

A Novel Method of Artificial Bandwidth Extension Using Deep Architecture.

DSVAE: Interpretable Disentangled Representation for Synthetic Speech Detection

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis

A Binaural Deep Neural Networks Parameter Mask for the Robust Automatic Speech Recognition System

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

A regression approach to speech enhancement based on deep neural networks

Binaural Deep Neural Network for Robust Speech Enhancement