End-to-end Speech Topic Classification Based on Pre-Trained Model Wavlm

Tengfei Cao,Liang He,Fangjing Niu
DOI: https://doi.org/10.1109/iscslp57327.2022.10037815
2022-01-01
Abstract:Speech topic classification (STC) is the task of automatically classifying audio segments into predefined categories, and has an increasingly wide application in the field of speech indexing, retrieval, surveillance, etc. Currently, the typical STC is a pipeline method consisting of automatic speech recognition (ASR), possible machine translation (MT), and text classification (TC). Although each component in the pipeline has a clear function and mature solutions, it suffers from error propagation and scarcity of annotated training data. To solve it, we propose a monolithic network based on pre-trained models to accomplish the speech topic classification task. The end-to-end training strategy based on the unified network structure avoids error propagation. And the pre-trained models reduce the requirements for a large amount of annotated data. Besides, the proposed method can take advantage of the intrinsic semantic feature of the speech for better performance. Our method carried out a series of experiments on the Fisher dataset. Compared with the traditional pipeline method, we also achieved an accuracy of 7 percentage points better than the traditional method without a large number of voice annotation data, so our method has huge advantages.
What problem does this paper attempt to address?