End-to-End Speech Recognition Model Based on Dilated Sparse Aware Network

Na Liu,Liejun Wang,Yongming Li,Yinfeng Yu
DOI: https://doi.org/10.1109/ainit61980.2024.10581771
2024-01-01
Abstract:Currently, end-to-end automatic speech recognition methods are widely used, with Conformer being among the most popular due to its exceptional ability to capture representational information without significantly increasing computational complexity, thereby improving recognition accuracy. Although self-attention mechanisms are crucial for capturing global information, models utilizing self-attention mechanisms can incur additional computational costs due to globally engaged receptive fields. To address this, this paper proposes an enhanced architecture called Dilated Sparse Aware Network (DSAN), based on a hybrid architecture combining Connectionist Temporal Classification (CTC) and Self-Attention mechanisms, aiming to achieve a better balance between computational efficiency and the scope of the engaged sensory field. Specifically, the paper introduces the spatial sparse attention mechanism to identify query dot product pairs that dominate attention, thereby constructing a sparse query matrix and reducing attention complexity. Additionally, the dilated sliding window attention module is employed to facilitate the interaction between sparse queries and local features within the sliding window, effectively utilizing local information within the receptive field, resulting in a wider receptive field size without increasing computational demands. Consequently, the proposed approach comprehensively incorporates contextual semantic information.
What problem does this paper attempt to address?