Abstract:The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.

Synthesizing Expressive Speech to Convey Focus using a Perturbation Model for Computer-Aided Pronunciation Training

Synthesizing English Emphatic Speech for Multimodal Corrective Feedback in Computer-Aided Pronunciation Training.

Generating Emphasis from Neutral Speech Using Hierarchical Perturbation Model by Decision Tree and Support Vector Machine

HMM-based Emphatic Speech Synthesis for Corrective Feedback in Computer-Aided Pronunciation Training

Modeling Prosody Patterns for Chinese Expressive Text-to-speech Synthesis

Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System

Audiovisual Synthesis of Exaggerated Speech for Corrective Feedback in Computer-Assisted Pronunciation Training

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

EMPHASIS: An Emotional Phoneme-based Acoustic Model for Speech Synthesis System

Controllable Emphatic Speech Synthesis Based on Forward Attention for Expressive Speech Synthesis

Visual-speech Synthesis of Exaggerated Corrective Feedback

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Generating emphatic speech with hidden Markov model for expressive speech synthesis

Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing Linguistic Information and Noisy Data

Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

MODELLING THE GLOBAL ACOUSTIC CORRELATES OF EXPRESSIVITY FOR CHINESE TEXT-TO-SPEECH SYNTHESIS

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Focus, lexical stress and boundary tone: interaction of three prosodic features

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems