Audio DistilBERT: A Distilled Audio BERT for Speech Representation Learning.

Fan Yu,Jiawei Guo,Wei Xi,Zhao Yang,Rui Jiang,Chao Zhang
DOI: https://doi.org/10.1109/ijcnn52387.2021.9533328
2021-01-01
Abstract:Self-supervised speech representation learning has been considered as an outstanding manner to improve the performance of downstream tasks. However, those models are often too cumbersome, which sets a barrier to deploy them on the edge and improves the threshold of the pre-training process. In this paper, we propose Audio DistilBERT, a distilled BERT-style speech representation learning method. It learns dark knowledge from a larger teacher model through one new designed loss which combines soft and hard targets. By doing this, it can achieve competitive performance with fewer parameters and faster inference time. The experimental results among two downstream tasks show that the proposed method can retain above 98% performance of the large model with about 1.8× smaller model size and over 1.6× faster inference speed. In a low-resource environment with very few labeled data and pretraining steps, our model also exhibits similar or even better performance compared to the large model. Furthermore, we explore the knowledge transfer competence between the teacher and student model.
What problem does this paper attempt to address?