Abstract:The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.

Analysis on Mispronunciations in Capt Based on Computational Speech Perception

Grading the Severity of Mispronunciations in CAPT Based on Statistical Analysis and Computational Speech Perception

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows

An Application of Modified Confusion Network for Improving Mispronunciation Detection in Computer-aided Mandarin Pronunciation Training

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Masked Acoustic Unit for Mispronunciation Detection and Correction

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

A review of tools and techniques for computer aided pronunciation training (CAPT) in English

PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

Are Scoring Feedback of CAPT Systems Helpful for Pronunciation Correction? –An Exception of Mandarin Nasal Finals

Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

End-to-end Mispronunciation Detection with Simulated Error Distance

A two-stage mispronunciation detection approach for computer-assisted pronunciation training

Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

Automated detection of pronunciation errors in non-native English speech employing deep learning

Perceptual Evaluation of Pronunciation Quality for Computer Assisted Language Learning

Applying Multitask Learning To Acoustic-Phonemic Model For Mispronunciation Detection And Diagnosis In L2 English Speech

Evaluating a computer-assisted pronunciation training (CAPT) technique for efficient classroom instruction

Adaptive Frequency Cepstral Coefficients for Word Mispronunciation Detection

Visual-speech Synthesis of Exaggerated Corrective Feedback