Abstract:Introduction: Enhancing the generalization and reliability of speech recognition models in the field of air traffic control (ATC) is a challenging task. This is due to the limited storage, difficulty in acquisition, and high labeling costs of ATC speech data, which may result in data sample bias and class imbalance, leading to uncertainty and inaccuracy in speech recognition results. This study investigates a method for assessing the quality of ATC speech based on accents. Different combinations of data quality categories are selected according to the requirements of different model application scenarios to address the aforementioned issues effectively. Methods: The impact of accents on the performance of speech recognition models is analyzed, and a fusion feature phoneme recognition model based on prior text information is constructed to identify phonemes of speech uttered by speakers. This model includes an audio encoding module, a prior text encoding module, a feature fusion module, and fully connected layers. The model takes speech and its corresponding prior text as input and outputs a predicted phoneme sequence of the speech. The model recognizes accented speech as phonemes that do not match the transcribed phoneme sequence of the actual speech text and quantitatively evaluates the accents in ATC communication by calculating the differences between the recognized phoneme sequence and the transcribed phoneme sequence of the actual speech text. Additionally, different levels of accents are input into different types of speech recognition models to analyze and compare the recognition accuracy of the models. Result: Experimental results show that, under the same experimental conditions, the highest impact of different levels of accents on speech recognition accuracy in ATC communication is 26.37%. Discussion: This further demonstrates that accents affect the accuracy of speech recognition models in ATC communication and can be considered as one of the metrics for evaluating the quality of ATC speech.

Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

A Computer-Assisted Tool for Automatically Measuring Non-Native Japanese Oral Proficiency

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Native and Non-native Speech Recognition Acoustic Modeling

Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information

Leveraging phone-level linguistic-acoustic similarity for utterance-level pronunciation scoring

Improving pronunciation assessment via ordinal regression with anchored reference samples

Assessment and analysis of accents in air traffic control speech: a fusion of deep learning and information theory

Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training

Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers.

ASR-Free Pronunciation Assessment

Spoken English Assessment System for Non-Native Speakers Using Acoustic and Prosodic Features.

End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

Integrating Articulatory Features into Acoustic-Phonemic Model for Mispronunciation Detection and Diagnosis in L2 English Speech.

The Relevance of Text and Speech Features in Automatic Non-native English Accent Identification

Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment

An Automatic Pronunciation Quality Assessing Algorithm for Computer Assisted Language Learning

A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment

Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition