Abstract:Speech pre-training method has shown great success in learning useful and general latent representations from large-scale unlabeled data. In order to further improve the performance of self-supervised learning method for specific downstream tasks, an improved approach based on data2vec framework with soft supervised hidden unit is proposed. To take full advantage of the labeled data from downstream task, a supervised model is firstly trained to extract supervised hidden unit. And then based on data2vec, an extra Bert-like prediction task with soft cluster distance is introduced to match the downstream task and avoid unnecessary information loss. The proposed method can form a virtuous circulating utilization pattern for the downstream labeld data. Experiments on the small open-source Mandarin speech corpus AISHELL-2 and large private-source Mandarin speech corpus TRANS-L tasks show that our method can achieve relative character error rate reductions of 13.2% and 5.2% respectively when pre-trained on AISHELL-2 and TRANS-L corpus compared with data2vec framework.

Improved Data2vec with Soft Supervised Hidden Unit for Mandarin Speech Recognition