Abstract:This work focuses on robust speech recognition in air traffic control (ATC) by designing a novel processing paradigm to integrate multilingual speech recognition into a single framework using three cascaded modules: an acoustic model (AM), a pronunciation model (PM), and a language model (LM). The AM converts ATC speech into phoneme-based text sequences that the PM then translates into a word-based sequence, which is the ultimate goal of this research. The LM corrects both phoneme- and word-based errors in the decoding results. The AM, including the convolutional neural network (CNN) and recurrent neural network (RNN), considers the spatial and temporal dependences of the speech features and is trained by the connectionist temporal classification loss. To cope with radio transmission noise and diversity among speakers, a multiscale CNN architecture is proposed to fit the diverse data distributions and improve the performance. Phoneme-to-word translation is addressed via a proposed machine translation PM with an encoder–decoder architecture. RNN-based LMs are trained to consider the code-switching specificity of the ATC speech by building dependences with common words. We validate the proposed approach using large amounts of real Chinese and English ATC recordings and achieve a 3.95% label error rate on Chinese characters and English words, outperforming other popular approaches. The decoding efficiency is also comparable to that of the end-to-end model, and its generalizability is validated on several open corpora, making it suitable for real-time approaches to further support ATC applications, such as ATC prediction and safety checking.

Cambridge University Transcription Systems for the Multi-Genre Broadcast Challenge.

The Development of the Cambridge University Alignment Systems for the Multi-Genre Broadcast Challenge.

The 2015 Sheffield System for Transcription of Multi-Genre Broadcast Media

LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge

The xmuspeech system for multi-channel multi-party meeting transcription challenge

Speaker Diarisation and Longitudinal Linking in Multi-Genre Broadcast Data.

Speech recognition challenge in the wild: Arabic MGB-3

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

The University of Edinburgh's Neural MT Systems for WMT17

Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems.

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023

The USTC System for Blizzard Challenge 2009

XMU Neural Machine Translation Systems for WMT 17.

Investigation of Multilingual Deep Neural Networks for Spoken Term Detection.

A Unified Framework for Multilingual Speech Recognition in Air Traffic Control Systems

Advanced Rich Transcription System for Estonian Speech

Cascaded Cross-Modal Transformer for Audio-Textual Classification

Joint Speech-Text Embeddings for Multitask Speech Processing