Mamba Fusion: Learning Actions Through Questioning

Zhikang Dong,Apoorva Beedu,Jason Sheinkopf,Irfan Essa
2024-09-18
Abstract:Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the high computational complexity, large GPU memory occupation, and difficulty in handling long - term dependencies faced by existing vision - language models (VLMs) when dealing with long - sequence dependencies and multimodal fusion. Specifically: 1. **Processing of long - sequence dependencies**: Traditional Transformer - based architectures are inefficient during training and inference and have difficulty effectively capturing dependencies over long time ranges when processing video data due to the quadratic computational complexity of their attention mechanisms. 2. **Effectiveness of multimodal fusion**: Existing VLMs usually rely on descriptions or subtitles to correlate visual and linguistic information, which limits the depth of the model's understanding of action recognition tasks. Moreover, these models lack effective mechanisms to integrate information from different modalities, especially in egocentric videos, and this limitation is more obvious. To solve the above problems, the authors propose the MambaVL model, which has the following characteristics: - **Selective state - space modal fusion**: By introducing a shared state - transition matrix, MambaVL can efficiently transfer information between visual and language modalities, thereby better capturing long - distance dependencies and learning joint representations. - **Question - answering task - guided learning**: To enhance the model's understanding of actions, the authors design a question - answering task. By generating questions related to verbs and nouns, it guides the model to focus on key cues in the video. This method not only helps improve the model's performance but also promotes deeper - level reasoning and understanding. In summary, this paper aims to improve the performance of vision - language models in action recognition tasks, especially when dealing with egocentric videos, by improving multimodal fusion methods and introducing new learning mechanisms.