Abstract:Introduction: Enhancing the generalization and reliability of speech recognition models in the field of air traffic control (ATC) is a challenging task. This is due to the limited storage, difficulty in acquisition, and high labeling costs of ATC speech data, which may result in data sample bias and class imbalance, leading to uncertainty and inaccuracy in speech recognition results. This study investigates a method for assessing the quality of ATC speech based on accents. Different combinations of data quality categories are selected according to the requirements of different model application scenarios to address the aforementioned issues effectively. Methods: The impact of accents on the performance of speech recognition models is analyzed, and a fusion feature phoneme recognition model based on prior text information is constructed to identify phonemes of speech uttered by speakers. This model includes an audio encoding module, a prior text encoding module, a feature fusion module, and fully connected layers. The model takes speech and its corresponding prior text as input and outputs a predicted phoneme sequence of the speech. The model recognizes accented speech as phonemes that do not match the transcribed phoneme sequence of the actual speech text and quantitatively evaluates the accents in ATC communication by calculating the differences between the recognized phoneme sequence and the transcribed phoneme sequence of the actual speech text. Additionally, different levels of accents are input into different types of speech recognition models to analyze and compare the recognition accuracy of the models. Result: Experimental results show that, under the same experimental conditions, the highest impact of different levels of accents on speech recognition accuracy in ATC communication is 26.37%. Discussion: This further demonstrates that accents affect the accuracy of speech recognition models in ATC communication and can be considered as one of the metrics for evaluating the quality of ATC speech.

What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

Accent Recognition with Hybrid Phonetic Features

Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

Probing self-supervised speech models for phonetic and phonemic information: a case study in aspiration

Improving Language Identification of Accented Speech

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0

Assessment and analysis of accents in air traffic control speech: a fusion of deep learning and information theory

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?

Leveraging Native Language Speech for Accent Identification using Deep Siamese Networks

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

Improved Accent Classification Combining Phonetic Vowels with Acoustic Features

Investigating model performance in language identification: beyond simple error statistics

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Resolving competing predictions in speech: How qualitatively different cues and cue reliability contribute to phoneme identification