Block-State Transformers

Mahan Fathi,Jonathan Pilault,Orhan Firat,Christopher Pal,Pierre-Luc Bacon,Ross Goroshin
2023-10-30
Abstract:State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance and efficiency of language models when dealing with long - sequence inputs. Specifically: 1. **Long - range dependency modeling**: Existing Transformer models face challenges when processing long sequences, especially in capturing long - distance dependency relationships. Although the Transformer can handle relatively long sequences through the self - attention mechanism, its time complexity is \(O(L^2)\), where \(L\) is the sequence length, which makes training and inference very expensive. In addition, the Transformer may become unstable when processing extremely long sequences, and the attention is concentrated on about 50 tokens around the current time step, resulting in the loss of long - distance information. 2. **Computational efficiency**: Although the Transformer performs well in parallel processing, its computational cost is still high when processing long sequences. In contrast, state - space models (SSMs) have higher computational efficiency when processing long sequences, with a time complexity of \(O(L \log L)\), and can use the fast Fourier transform (FFT) for parallel computation. However, the performance of SSMs in general - purpose language modeling tasks has not yet fully matched that of the Transformer. 3. **Combining advantages**: The paper proposes a new architecture - Block - State Transformer (BST), aiming to combine the local attention mechanism of the Transformer and the long - range context modeling ability of SSMs. In this way, BST can not only handle longer input sequences, but also maintain efficient computational performance. ### Specific goals: - **Improve language modeling performance**: By combining the advantages of SSMs and the Transformer, BST achieves better perplexity than existing Transformer models in language modeling tasks. - **Extend to longer sequences**: BST can effectively handle longer sequences without significantly increasing the computational cost. - **Accelerate training and inference**: BST is more than 10 times faster than the Block - Recurrent Transformer at the layer level, especially when using model parallelization. ### Method overview: - **SSM sub - layer**: Responsible for providing long - range context information and achieving efficient convolution operations through FFT. - **Block Transformer sub - layer**: Handles short - range representations and captures local information through the block - attention mechanism. - **Three variants**: The paper studies three different parallelization methods (Single - Head, Multi - Head, Multi - Filter) to explore how to most effectively combine SSM state information with the attention mechanism. ### Experimental results: - **Data sets**: The experiments were carried out on three data sets, PG19, arXiv, and GitHub, covering different types of texts (books, scientific articles, source code). - **Performance improvement**: Under the same computational budget, BST achieved a 1.5% - 4% performance improvement compared to baseline models (such as BRECT, GSS - HYBRID - L) on multiple data sets. - **Computational efficiency**: The speed of BST at the layer level is significantly better than that of other models, especially when processing long sequences. In conclusion, by proposing the Block - State Transformer architecture, this paper successfully solves the performance and efficiency problems of existing Transformer models when processing long sequences, providing a new solution for long - range language modeling tasks.