Progressive Multi-scale Self-supervised Learning for Speech Recognition

Genshun Wan,Hang Chen,Tan Liu,Chenxi Wang,Jia Pan,Zhongfu Ye
DOI: https://doi.org/10.1109/apsipaasc58517.2023.10317133
2023-01-01
Abstract:Self-supervised learning has shown great potential in improving automatic speech recognition (ASR) systems. However, further improvements in recognition performance could be achieved if models focus on audio content information learning. In this paper, we propose a progressive multi-scale self-supervised learning method that reinforces the learning process from easy to difficult. Our progressive strategy utilizes fine-grained target sets to compute self-supervised learning loss at the top layer while using coarse-grained target sets at intermediate layers. Additionally, to match the difficulty of the learning process, we introduce a multi-scale structure into the multi-head self-attention module. We evaluate our method on the Librispeech dataset and demonstrate its effectiveness. Our proposed method achieves a relative word error rate (WER) reduction of 13.7% and 12.7% on the test other evaluation subsets, respectively, when fine-tuned on 10-hour and 100-hour subsets, outperforming HuBERT.
What problem does this paper attempt to address?