Abstract:Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.

Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system

Speech Intelligibility Based Enhancement System Using Modified Deep Neural Network and Adaptive Multi-band Spectral Subtraction

Multimodal Speech Recognition Using EEG and Audio Signals: A Novel Approach for Enhancing ASR Systems

A hybrid discriminant fuzzy DNN with enhanced modularity bat algorithm for speech recognition

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Robust Audio Sensing with Multi-Sound Classification.

An Evaluation on Speech Recognition Technology based on Machine Learning

Audio Classification of Low Feature Spectrograms Utilizing Convolutional Neural Networks

Deep Learning based Multilingual Speech Synthesis using Multi Feature Fusion Methods

An Efficient Voice Authentication System using Enhanced Inceptionv3 Algorithm

A stacked convolutional neural network framework with multi-scale attention mechanism for text-independent voiceprint recognition

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application

Development of Indian Spoken Language Identification System for Two Languages using MFCC Feature with Deep Neural Network

A robust accent classification system based on variational mode decomposition

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Voice disorder classification using speech enhancement and deep learning models

Acoustic scene classification using auditory datasets

Speaker Identification Using MFCC Feature Extraction ANN Classification Technique

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Coherent Feature Extraction with Swarm Intelligence Based Hybrid Adaboost Weighted ELM Classification for Snoring Sound Classification

Structured Discriminative Models Using Deep Neural-Network Features.