A Window Attention Based Transformer for Automatic Speech Recognition

Zhao Feng,Yongming Li
DOI: https://doi.org/10.1109/iccea62105.2024.10603610
2024-01-01
Abstract:Currently, numerous models in automatic speech recognition employ self-attention mechanisms, including Transformer and Conformer. Transformer's excellence in recognition results stems from its reliance on a global attention mechanism. However, a single global self-attention mechanism requires high computational costs and lacks modeling of local information. Inspired by the Swin-Transformer in Computer Vision, which effectively reduces computational complexity by dividing images into sub-images, this paper introduces a window self-attention mechanism tailored for speech data. This mechanism aims to decrease computational complexity while enhancing local modeling capabilities. Recognizing that relying solely on window self-attention may compromise global modeling, leading to performance degradation, we introduce a shifted window mechanism. This mechanism enables soft global modeling and mitigates the isolation effects caused by segmented windows. The Slide-Transformer model alternates between the window self-attention mechanism module and the shifted window self-attention module. Finally, we conducted experiments on the Common Voice Portuguese, Common Voice Dutch, and Aishell1 Chinese datasets. The results demonstrate that the Slide-Transformer model achieves a lower character error rate compared to the baseline Transformer, thus validating its effectiveness in improving speech recognition accuracy.
What problem does this paper attempt to address?