Speech Topic Classification Based on Multi-Scale and Graph Attention Networks

Fangjing Niu,Xiaozhe Qi,Xinya Chen,Liang He
DOI: https://doi.org/10.21437/interspeech.2024-1934
2024-01-01
Abstract:Speech topic classification (STC) typically consists of two parts: first, the speech is automatically transcribed into text using automatic speech recognition (ASR), and then the transcribed text is subjected to text-based topic recognition. However, this method often suffers from issues such as error propagation and the lack of global structural information. In this paper, we employ a multi-scale convolutional network to capture local semantic features of different granularities in the temporal dimension by using convolutional kernels of various sizes. Then, we utilize an attention mechanism to learn the similarity relationships between nodes. By using the top-K mask, we select the K most relevant nodes to construct a graph network. Finally, we aggregate node features to capture the dependency relationships of global context. Our method achieved state-of-the-art performance on the Fisher and Switchboard datasets, even surpassing the classification accuracy on oracle transcripts.
What problem does this paper attempt to address?