Abstract:Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hungry Hippos (H3), to replace or complement the multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form sequences with a linear-order computation. In experiments using two datasets of CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and MHSA and show that using H3 in higher layers and MHSA in lower layers provides significant improvement in online recognition. We also investigate a parallel use of H3 and MHSA in all layers, resulting in the best performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficient and robust recognition of long - term speech (such as lectures, meetings, etc.). Specifically, the performance of traditional Transformer - based models (such as Conformer) drops significantly when dealing with long - term speech, because the computational complexity of the multi - head self - attention mechanism (MHSA) is on the order of the square of the input length, resulting in excessive computation and unreliability. ### Core of the problem 1. **Challenges in long - term speech recognition**: For long - term speech (such as lectures, meetings), the performance of traditional models will drop significantly, especially in online recognition scenarios. 2. **Limitations of the self - attention mechanism**: The computational complexity of MHSA is \(O(L^2)\), where \(L\) is the length of the input sequence, which makes processing long sequences very expensive and unstable. ### Solutions To solve the above problems, the author introduced a new state - space model (SSM), called Hungry Hungry Hippos (H3), to replace or supplement MHSA. The H3 model has a linear - order computational complexity of \(O(L)\), can process long sequences more efficiently, and has shown better performance in experiments. ### Specific improvements 1. **H3 - Conformer model**: Directly replace the MHSA layer in Conformer with the H3 layer to form the H3 - Conformer model. 2. **Hybrid H3 - Conformer (CH4) model**: A hybrid model is proposed, which selectively uses the H3 layer or the MHSA layer in different encoder layers. 3. **Parallel CH4 model**: Use the H3 layer and the MHSA layer in parallel in each layer, combining the advantages of both. ### Experimental results Experiments on two datasets, CSJ and LibriSpeech, verified the superiority of the proposed models in long - term speech recognition tasks, especially in online recognition scenarios, where the CH4 and H3 - Conformer models performed significantly better than the traditional Conformer model. ### Summary This paper solves the computational complexity and performance problems in long - term speech recognition by introducing the H3 model, providing a new solution for the efficient and robust recognition of long - term speech.

Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping

MEConformer: Highly representative embedding extractor for speaker verification via incorporating selective convolution into deep speaker encoder

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

Efficient conformer-based speech recognition with linear attention

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

Conformer with dual-mode chunked attention for joint online and offline ASR

SPformer: Hybrid Sequential-Parallel Architectures for Automatic Speech Recognition

Efficient End-to-End Speech Recognition Using Performers in Conformers

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Continual Learning for On-Device Speech Recognition using Disentangled Conformers

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Conformers are All You Need for Visual Speech Recognition

Towards A Unified Conformer Structure: from ASR to ASV Task

Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR