External Text Based Data Augmentation for Low-Resource Speech Recognition in the Constrained Condition of OpenASR21 Challenge

Guolong Zhong,Hongyu Song,Ruoyu Wang,Lei Sun,Diyuan Liu,Jia Pan,Xin Fang,Jun Du,Jie Zhang,Lirong Dai
DOI: https://doi.org/10.21437/interspeech.2022-649
2022-01-01
Abstract:This paper describes our USTC NELSLIP system submitted to the Open Automatic Speech Recognition (OpenASR21) Challenge for the Constrained condition, where only a 10-hour speech dataset is allowed for training while additional text data is unlimited. To improve the low-resource speech recognition performance, we collect external text data for language modeling and train a text-to-speech (TTS) model to generate speech-text paired data. Our system is then built based on the con-ventional hybrid structure, where various subsystems are devel-oped using different acoustic neural network architectures and different data augmentation methods. Finally, system fusion is employed to obtain the final result. Experiments on the OpenASR21 challenge show that the proposed system achieves the best performance for all testing languages.
What problem does this paper attempt to address?