Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

Tomoki Honda,Shinsuke Sakai,Tatsuya Kawahara
DOI: https://doi.org/10.21437/Interspeech.2024-258
2024-10-05
Abstract:Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also investigate a parallel use of H3 and MHSA in all layers, resulting in the best performance.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient and robust recognition of long - term speech (such as lectures, meetings, etc.). Specifically, the performance of traditional Transformer - based models (such as Conformer) drops significantly when dealing with long - term speech, because the computational complexity of the multi - head self - attention mechanism (MHSA) is on the order of the square of the input length, resulting in excessive computation and unreliability. ### Core of the problem 1. **Challenges in long - term speech recognition**: For long - term speech (such as lectures, meetings), the performance of traditional models will drop significantly, especially in online recognition scenarios. 2. **Limitations of the self - attention mechanism**: The computational complexity of MHSA is \(O(L^2)\), where \(L\) is the length of the input sequence, which makes processing long sequences very expensive and unstable. ### Solutions To solve the above problems, the author introduced a new state - space model (SSM), called Hungry Hungry Hippos (H3), to replace or supplement MHSA. The H3 model has a linear - order computational complexity of \(O(L)\), can process long sequences more efficiently, and has shown better performance in experiments. ### Specific improvements 1. **H3 - Conformer model**: Directly replace the MHSA layer in Conformer with the H3 layer to form the H3 - Conformer model. 2. **Hybrid H3 - Conformer (CH4) model**: A hybrid model is proposed, which selectively uses the H3 layer or the MHSA layer in different encoder layers. 3. **Parallel CH4 model**: Use the H3 layer and the MHSA layer in parallel in each layer, combining the advantages of both. ### Experimental results Experiments on two datasets, CSJ and LibriSpeech, verified the superiority of the proposed models in long - term speech recognition tasks, especially in online recognition scenarios, where the CH4 and H3 - Conformer models performed significantly better than the traditional Conformer model. ### Summary This paper solves the computational complexity and performance problems in long - term speech recognition by introducing the H3 model, providing a new solution for the efficient and robust recognition of long - term speech.