Abstract:As a new deep learning algorithm framework, Transformer has attracted more and more researchers?? attention and has become a current research hotspot. Inspired by humans focusing on important things only, the self-attention mechanism in the Transformer model mainly learns important information in the input sequence. For speech recogni-tion tasks, the focus is to transcribe the information of the input speech sequence into the corresponding language text. The past practice was to combine acoustic models, pronunciation dictionaries, and language models into a speech recognition system to achieve speech recognition tasks, while Transformer can integrate them into a single neural network to form an end-to-end speech recognition system, which solves the issues such as forced alignment and multi-module training of the traditional speech recognition system. Therefore, it is very necessary to discuss the problems of Transformer in speech recognition tasks. In this paper, the structure of the Transformer model is first introduced. Besides, the problems confronted by speech recognition are analyzed with respect to input speech sequence, deep model architecture, and model inference. Then the methods to solve the obstacles within the three aspects afore mentioned are outlined and summarized. Finally, the future application and direction of Transformer in speech recognition are concluded and prospected.

End-to-end Oriental Language Speech Recognition with Integrated Language Identification

Oriental Language Recognition (OLR) 2020: Summary and Analysis

OLR 2021 Challenge: Datasets, Rules and Baselines

C L ] 2 3 Ju l 2 02 1 OLR 2021 CHALLENGE : DATASETS , RULES AND BASELINES

Oriental Language Recognition (OLR) 2021: Summary and Analysis

A Self-Supervised Model for Language Identification Integrating Phonological Knowledge

AP20-OLR Challenge: Three Tasks and Their Baselines

The XMUSPEECH System for the AP19-OLR Challenge

AP19-OLR Challenge: Three Tasks and Their Baselines

Chinese Dialect Speech Recognition Based on End-to-end Machine Learning

A ug 2 01 9 AP 19-OLR Challenge : Three Tasks and Their Baselines

Ju l 2 01 9 AP 19-OLR Challenge : Three Tasks and Their Baselines

Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification

Phone-Aware Multi-task Learning and Length Expanding for Short-Duration Language Recognition.

Research Status and Prospect of Transformer in Speech Recognition

Towards Language-Universal Mandarin-English Speech Recognition

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Rnn-transducer with Language Bias for End-to-end Mandarin-English Code-switching Speech Recognition

AP17-OLR Challenge: Data, Plan, and Baseline

AP18-OLR Challenge: Three Tasks and Their Baselines

An Empirical Study of Language Model Integration for Transducer Based Speech Recognition