Abstract:Introduction: Enhancing the generalization and reliability of speech recognition models in the field of air traffic control (ATC) is a challenging task. This is due to the limited storage, difficulty in acquisition, and high labeling costs of ATC speech data, which may result in data sample bias and class imbalance, leading to uncertainty and inaccuracy in speech recognition results. This study investigates a method for assessing the quality of ATC speech based on accents. Different combinations of data quality categories are selected according to the requirements of different model application scenarios to address the aforementioned issues effectively. Methods: The impact of accents on the performance of speech recognition models is analyzed, and a fusion feature phoneme recognition model based on prior text information is constructed to identify phonemes of speech uttered by speakers. This model includes an audio encoding module, a prior text encoding module, a feature fusion module, and fully connected layers. The model takes speech and its corresponding prior text as input and outputs a predicted phoneme sequence of the speech. The model recognizes accented speech as phonemes that do not match the transcribed phoneme sequence of the actual speech text and quantitatively evaluates the accents in ATC communication by calculating the differences between the recognized phoneme sequence and the transcribed phoneme sequence of the actual speech text. Additionally, different levels of accents are input into different types of speech recognition models to analyze and compare the recognition accuracy of the models. Result: Experimental results show that, under the same experimental conditions, the highest impact of different levels of accents on speech recognition accuracy in ATC communication is 26.37%. Discussion: This further demonstrates that accents affect the accuracy of speech recognition models in ATC communication and can be considered as one of the metrics for evaluating the quality of ATC speech.

Deep Discriminative Feature Learning for Accent Recognition

SAR-Net: A End-to-End Deep Speech Accent Recognition Network

Accent Recognition with Hybrid Phonetic Features

Leveraging Native Language Speech for Accent Identification using Deep Siamese Networks

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

An Acoustic Model for English Speech Recognition Based on Deep Learning

Assessment and analysis of accents in air traffic control speech: a fusion of deep learning and information theory

Improved Accent Classification Combining Phonetic Vowels with Acoustic Features

Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

Structured Discriminative Models Using Deep Neural-Network Features.

Improving Blstm Rnn Based Mandarin Speech Recognition Using Accent Dependent Bottleneck Features

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

English speech recognition based on deep learning with multiple features

Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Using deep learning to classify English native pronunciation level from acoustic information

Deep joint learning for language recognition

Speaker Recognition Based on Pre-Trained Model and Deep Clustering