Abstract:Traditionally, speech quality evaluation relies on subjective assessments or intrusive methods that require reference signals or additional equipment. However, over recent years, non-intrusive speech quality assessment has emerged as a promising alternative, capturing much attention from researchers and industry professionals. This article presents a deep learning-based method that exploits large-scale intrusive simulated data to improve the accuracy and generalization of non-intrusive methods. The major contributions of this article are as follows. First, it presents a data simulation method, which generates degraded speech signals and labels their speech quality with the perceptual objective listening quality assessment (POLQA). The generated data is proven to be useful for pretraining the deep learning models. Second, it proposes to apply an adversarial speaker classifier to reduce the impact of speaker-dependent information on speech quality evaluation. Third, an autoencoder-based deep learning scheme is proposed following the principle of representation learning and adversarial training (AT) methods, which is able to transfer the knowledge learned from a large amount of simulated speech data labeled by POLQA. With the help of discriminative representations extracted from the autoencoder, the prediction model can be trained well on a relatively small amount of speech data labeled through subjective listening tests. Fourth, an end-to-end speech quality evaluation neural network is developed, which takes magnitude and phase spectral features as its inputs. This phase-aware model is more accurate than the model using only the magnitude spectral features. A large number of experiments are carried out with three datasets: one simulated with labels obtained using POLQA and two recorded with labels obtained using subjective listening tests. The results show that the presented phase-aware method improves the performance of the baseline model and the proposed model with latent representations extracted from the adversarial autoencoder (AAE) outperforms the state-of-the-art objective quality assessment methods, reducing the root mean square error (RMSE) by 10.5% and 12.2% on the Beijing Institute of Technology (BIT) dataset and Tencent Corpus, respectively. The code and supplementary materials are available at https://github.com/liushenme/AAE-SQA.

Multi-Modal Multi-Scale Speech Expression Evaluation In Computer-Assisted Language Learning

Pronunciation Assessment with Multi-modal Large Language Models

A Spoken English Teaching System Based on Speech Recognition and Machine Learning

Standardized Evaluation Method of Pronunciation Teaching Based on Deep Learning

An Automatic Pronunciation Quality Assessing Algorithm for Computer Assisted Language Learning

Perceptual Evaluation of Pronunciation Quality for Computer Assisted Language Learning

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

A Multi-Strategy Computer-Assisted EFL Writing Learning System With Deep Learning Incorporated and Its Effects on Learning: A Writing Feedback Perspective

Evaluation Model of College English Multimedia Teaching Effect Based on Deep Convolutional Neural Networks

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Speech Recognition of Oral English Teaching Based on Deep Belief Network

Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil

Multimedia English teaching analysis based on deep learning speech enhancement algorithm and robust expression positioning

Pronunciation Evaluation Technology Based on Computer Aided Chinese Learning System

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Exploring the Teaching Mode of English Audiovisual Speaking in Multimedia Network Environment

Emphasis Detection for Voice Dialogue Applications Using Multi-channel Convolutional Bidirectional Long Short-Term Memory Network

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

A Transfer and Multi-Task Learning Based Approach for MOS Prediction