Improved Data2vec with Soft Supervised Hidden Unit for Mandarin Speech Recognition
Genshun Wan,Hang Chen,Pengcheng Li,Jia Pan,Zhongfu Ye
DOI: https://doi.org/10.1109/apsipaasc58517.2023.10317288
2023-01-01
Abstract:Speech pre-training method has shown great success in learning useful and general latent representations from large-scale unlabeled data. In order to further improve the performance of self-supervised learning method for specific downstream tasks, an improved approach based on data2vec framework with soft supervised hidden unit is proposed. To take full advantage of the labeled data from downstream task, a supervised model is firstly trained to extract supervised hidden unit. And then based on data2vec, an extra Bert-like prediction task with soft cluster distance is introduced to match the downstream task and avoid unnecessary information loss. The proposed method can form a virtuous circulating utilization pattern for the downstream labeld data. Experiments on the small open-source Mandarin speech corpus AISHELL-2 and large private-source Mandarin speech corpus TRANS-L tasks show that our method can achieve relative character error rate reductions of 13.2% and 5.2% respectively when pre-trained on AISHELL-2 and TRANS-L corpus compared with data2vec framework.
What problem does this paper attempt to address?