Abstract:The automatic identification and analysis of pronunciation errors, known as Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning or speech therapy applications. Existing MDD methods relying on analysing phonemes can only detect categorical errors of phonemes that have an adequate amount of training data to be modelled. With the unpredictable nature of the pronunciation errors of non-native or disordered speakers and the scarcity of training datasets, it is unfeasible to model all types of mispronunciations. Moreover, phoneme-level MDD approaches have a limited ability to provide detailed diagnostic information about the error made. In this paper, we propose a low-level MDD approach based on the detection of speech attribute features. Speech attribute features break down phoneme production into elementary components that are directly related to the articulatory system leading to more formative feedback to the learner. We further propose a multi-label variant of the Connectionist Temporal Classification (CTC) approach to jointly model the non-mutually exclusive speech attributes using a single model. The pre-trained wav2vec2 model was employed as a core model for the speech attribute detector. The proposed method was applied to L2 speech corpora collected from English learners from different native languages. The proposed speech attribute MDD method was further compared to the traditional phoneme-level MDD and achieved a significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR), and Diagnostic Error Rate (DER) over all speech attributes compared to the phoneme-level equivalent.

Unsupervised Discovery of an Extended Phoneme Set in L2 English Speech for Mispronunciation Detection and Diagnosis.

Unsupervised Discovery of Non-native Phonetic Patterns in L2 English Speech for Mispronunciation Detection and Diagnosis

Deep segmental phonetic posterior-grams based discovery of non-categories in L2 English speech

Applying Multitask Learning To Acoustic-Phonemic Model For Mispronunciation Detection And Diagnosis In L2 English Speech

Integrating Articulatory Features into Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech.

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method

End-to-end Mispronunciation Detection with Simulated Error Distance

A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques

An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis.

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning

L1-aware Multilingual Mispronunciation Detection Framework

An adaptive unsupervised clustering of pronunciation errors for automatic pronunciation error detection

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

A two-stage mispronunciation detection approach for computer-assisted pronunciation training

A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models

Improvement in Text-Dependent Mispronunciation Detection for English Learners

Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks

Automatic Pronunciation Error Detection Based on Extended Pronunciation Space Using the Unsupervised Clustering of Pronunciation Errors