Abstract:This report proposes state-of-the-art research in the field of Computer Assisted Language Learning (CALL). Mispronunciation detection is one of the core components of Computer Assisted Pronunciation Training (CAPT) systems which is a subset of CALL. Studies on automated pronunciation error detection began in the 1990s, but the development of fullfledged CAPTs has only accelerated in the last decade due to an increase in computing power and availability of mobile devices for recording speech required for pronunciation analysis. Detecting Pronunciation errors is a hard problem to solve as there is no formal definition of correct and incorrect pronunciation. As a result, typically prosodic and phoneme errors such as phoneme substitution, insertion, and deletion are detected. Also, it has been agreed upon that learning pronunciation should focus on speaker intelligibility rather than sounding like an L1 English speaker. Initially, methods were developed on posterior likelihood called Good of Pronunciation using Gaussian Mixture Model-Hidden Markov Model and Deep Neural Network-Hidden Markov Model approaches. These are complex systems to implement when compared with the recently proposed ASR based End-to-End mispronunciations detection systems. The purpose of this research is to create End-to-End (E2E) models using Connectionist Temporal Classification (CTC) and Attention-based sequence decoder. Recently, E2E models have shown considerable improvement in mispronunciation detection accuracy. This research will draw comparison amongst baseline models CNN-RNN-CTC, CNN-RNN-CTC with character sequence-based attention decoder, and CNN-RNN-CTC with phoneme-based decoder systems. This study will help us in deciding a better approach towards developing an efficient mispronunciation detection system.

Improving pronunciation erroneous tendency detection with convolutional long short-term memory

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows

End-to-end Mispronunciation Detection with Simulated Error Distance

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Grading the Severity of Mispronunciations in CAPT Based on Statistical Analysis and Computational Speech Perception

Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

An Application of Modified Confusion Network for Improving Mispronunciation Detection in Computer-aided Mandarin Pronunciation Training

Analysis on Mispronunciations in Capt Based on Computational Speech Perception

Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Improve Mispronunciation Detection with Tandem Feature

Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers.

Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

An Automatic Pronunciation Quality Assessing Algorithm for Computer Assisted Language Learning

Masked Acoustic Unit for Mispronunciation Detection and Correction

Evaluation Model of College English Multimedia Teaching Effect Based on Deep Convolutional Neural Networks

Are Scoring Feedback of CAPT Systems Helpful for Pronunciation Correction? –An Exception of Mandarin Nasal Finals

PTeacher: a Computer-Aided Personalized Pronunciation Training System with Exaggerated Audio-Visual Corrective Feedback

Mispronunciation Detection with an Optimized Detection Network and Multi-Layer Perception Based Features

Applying Multitask Learning To Acoustic-Phonemic Model For Mispronunciation Detection And Diagnosis In L2 English Speech

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition