Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

Georgios Pantazopoulos,Malvina Nikandrou,Alessandro Suglia,Oliver Lemon,Arash Eshghi
2024-10-01
Abstract:This study explores replacing Transformers in Visual Language Models (VLMs) with Mamba, a recent structured state space model (SSM) that demonstrates promising performance in sequence modeling. We test models up to 3B parameters under controlled conditions, showing that Mamba-based VLMs outperforms Transformers-based VLMs in captioning, question answering, and reading comprehension. However, we find that Transformers achieve greater performance in visual grounding and the performance gap widens with scale. We explore two hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Overall, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of explicit information from the context is required.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the effect of replacing the Transformer with the Mamba model in Structured State - Space Models (SSMs) in Vision - and - Language Models (VLMs). Specifically, the research aims to answer the following key questions: 1. **Performance Comparison**: Can the Mamba model match or even outperform Transformer - based VLMs in various vision - and - language tasks (such as image captioning, visual question answering, and reading comprehension)? 2. **Task Differences**: What are the differences in the performance of Mamba and Transformer on different types of multimodal tasks? In particular, what is the performance gap between the two in tasks that require fine - grained information (such as visual localization)? 3. **Cause Analysis**: If there are performance differences, what are the reasons behind them? In particular, why does Mamba perform well in some tasks but not as well as Transformer in others? To answer these questions, the author has carried out the following work: - Trained and evaluated Mamba - VL and Pythia - VL (Transformer - based VLM) models with multiple orders of magnitude of parameters (from 790M to 2.8B). - Conducted tests in a series of multimodal tasks, including coarse - grained tasks (such as image captioning, visual question answering) and fine - grained tasks (such as visual localization). - Explored the impact of task - agnostic visual encoding and multimodal retrieval on model performance. Through these experiments, the author has found that: - Mamba - VL performs excellently in tasks such as image captioning, visual question answering, and reading comprehension, and even outperforms Pythia - VL of the same scale. - However, in tasks that require precise retrieval of information from the context (such as visual localization), Pythia - VL significantly outperforms Mamba - VL, and this gap further widens as the model scale increases. - The author has proposed two hypotheses to explain this phenomenon: the difficulty of task - agnostic visual encoding and multimodal retrieval, and has verified these hypotheses through experiments. In conclusion, through systematic experiments and analysis, this paper reveals the advantages and limitations of the Mamba model in multimodal tasks, providing valuable references for future research.