Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin,Maximilian Beck,Korbinian Pöppel,Sepp Hochreiter,Johannes Brandstetter
2024-07-02
Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the problem of applying the extended Long Short-Term Memory (xLSTM) architecture to computer vision tasks and proposes a new visual backbone network—Vision-LSTM (ViL). Specifically, the main objectives of the paper include: 1. **Introducing ViL as a general visual backbone**: By adapting xLSTM blocks to computer vision tasks, constructing an efficient visual model. 2. **Overcoming the limitations of traditional LSTM in visual tasks**: Traditional LSTM models have shortcomings when processing image data. ViL efficiently handles non-sequential inputs like images by alternately using different mLSTM blocks. 3. **Demonstrating the performance of ViL on multiple visual tasks**: Validating the effectiveness of ViL in tasks such as image classification and semantic segmentation through experiments, and comparing it with existing methods. Overall, the study aims to showcase the potential of xLSTM in the field of computer vision and propose a new efficient visual model, ViL.