Self-Convolution for Automatic Speech Recognition.

Tianhao Zhang,Qi Liu,Xinyuan Qian,Song-Lu Chen,Feng Chen,Xu-Cheng Yin
DOI: https://doi.org/10.1109/icassp49357.2023.10095330
2023-01-01
ICASSP
Abstract:Self-attention plays a significant role in recent automatic speech recognition (ASR) models with promising results. However, it suffers from high computational complexity and weak capability in modeling local information. In contrast, the convolutional neural network (CNN) is computationally effective and superior in learning local information. Whereas it fails in self-interaction and capturing long-range dependence among input tokens. Accordingly, we take their complementary advantages and propose a new module, namely self-convolution, to compensate for each individual limitations. Specifically, self-convolution generates convolution kernels at each token (to model local information) which are then used to convolve itself (for self-interaction). Moreover, we bring in global information during the generation of convolution kernel to enhance the learning of long-range dependencies. In this way, the advantages of self-attention and CNN are both utilized. We conduct rigorous experiments on LibriSpeech, Tedlium2, and AIShell1 datasets and demonstrate that our proposed self-convolution can achieve superior ASR performance than self-attention with less computational cost.
What problem does this paper attempt to address?