Abstract:Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the high computational complexity, large GPU memory occupation, and difficulty in handling long - term dependencies faced by existing vision - language models (VLMs) when dealing with long - sequence dependencies and multimodal fusion. Specifically: 1. **Processing of long - sequence dependencies**: Traditional Transformer - based architectures are inefficient during training and inference and have difficulty effectively capturing dependencies over long time ranges when processing video data due to the quadratic computational complexity of their attention mechanisms. 2. **Effectiveness of multimodal fusion**: Existing VLMs usually rely on descriptions or subtitles to correlate visual and linguistic information, which limits the depth of the model's understanding of action recognition tasks. Moreover, these models lack effective mechanisms to integrate information from different modalities, especially in egocentric videos, and this limitation is more obvious. To solve the above problems, the authors propose the MambaVL model, which has the following characteristics: - **Selective state - space modal fusion**: By introducing a shared state - transition matrix, MambaVL can efficiently transfer information between visual and language modalities, thereby better capturing long - distance dependencies and learning joint representations. - **Question - answering task - guided learning**: To enhance the model's understanding of actions, the authors design a question - answering task. By generating questions related to verbs and nouns, it guides the model to focus on key cues in the video. This method not only helps improve the model's performance but also promotes deeper - level reasoning and understanding. In summary, this paper aims to improve the performance of vision - language models in action recognition tasks, especially when dealing with egocentric videos, by improving multimodal fusion methods and introducing new learning mechanisms.

Mamba Fusion: Learning Actions Through Questioning

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

VL-Mamba: Exploring State Space Models for Multimodal Learning

VideoMamba: State Space Model for Efficient Video Understanding

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Vamos: Versatile Action Models for Video Understanding

VideoMambaPro: A Leap Forward for Mamba in Video Understanding

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion

SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

VideoMamba: Spatio-Temporal Selective State Space Model

An Empirical Study of Mamba-based Language Models

Can VLMs be used on videos for action recognition? LLMs are Visual Reasoning Coordinators

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

LocalMamba: Visual State Space Model with Windowed Selective Scan

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions