SPformer: Hybrid Sequential-Parallel Architectures for Automatic Speech Recognition

Jiaqi Chen,Mingdong Yu,Bo Wang,Xiaofeng Jin,Guirong Wang
DOI: https://doi.org/10.1109/ICME57554.2024.10687992
2024-07-15
Abstract:In recent years, the capability to interact with multi-scale information has been regarded as a crucial aspect of the Automatic Speech Recognition (ASR) encoder’s abilities. Conformer and Branchformer, representing sequential and parallel architectural designs, respectively, facilitate the interaction between global and local information, achieving state-of-the-art performance. However, sequential architectures struggle with explicability in the interaction process and rigid model design, while parallel architectures face challenges in integration difficulties and limited interaction. To address these issues, we propose the SPformer, effectively combining sequential connection and parallel branch architectures. It allows dynamic interaction between convolution and self-attention while utilizing branch structures. The SPformer’s performance, both in-domain and out-of-domain, surpasses that of Conformer and E-Branchformer, as demonstrated by our experiments on public ASR datasets.
Engineering,Computer Science
What problem does this paper attempt to address?