Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to effectively fine - tune large - scale vision - language models (LVLMs) for domain - specific tasks while maintaining high efficiency and scalability?** Specifically, although existing large - scale vision - language models perform well in a variety of multi - modal tasks, they still face the challenges of intensive computational resources and large memory occupation when fine - tuning for domain - specific applications. Traditional fine - tuning methods usually need to update a large number of model parameters, which not only has high computational costs but may also damage the original representation ability of the model. In addition, the method of adding external adapters also has problems such as poor scalability and difficulty in capturing long - distance dependencies. To solve these problems, this paper proposes a new fine - tuning method - **State Space Memory Integration (SSMI)**. SSMI can effectively capture long - distance dependencies and inject task - specific visual and sequence patterns by integrating the Mamba - based state - space module into the LVLM architecture. Compared with traditional methods, SSMI only needs to update a small part of the model parameters, thereby significantly reducing the computational cost and improving the adaptability and interpretability of the model. ### Main contributions: 1. **Proposed the SSMI method**: By introducing the Mamba - based state - space module, effective fine - tuning of LVLM is achieved. 2. **Demonstrated superior performance**: The experimental results on multiple benchmark datasets show that SSMI is superior to existing methods in performance, and at the same time has higher parameter efficiency and computational scalability. 3. **Provided comprehensive experimental verification**: Through detailed ablation studies and comparative analyses, the effectiveness and robustness of SSMI in different vision - language tasks are verified. ### Formula summary: - State - space dynamic equation: \[ s_{t + 1}=A s_t + B h_t \] \[ y_t = C s_t + D h_t \] where \(s_t\) is the state at time \(t\), \(h_t\) is the input, \(y_t\) is the output, and \(A, B, C, D\) are learnable parameters. - State - space output after frequency - domain discretization: \[ Y = C(I - zA)^{- 1}BH + DH \] where \(H\) is the input sequence and \(z\) is the discretization operator. - Pre - training loss function: \[ L_{\text{pretrain}}=\frac{1}{T}\sum_{t = 1}^{T}\|y_t - t_t\|^2_2 \] where \(y_t\) is the output of the state - space model and \(t_t\) is the target text embedding. - Task - specific fine - tuning loss function: \[ L_{\text{task}}=E(X, Y)[L(\hat{Y}, Y)] \] where \(\hat{Y}\) is the model's prediction and \(L\) is the task - specific loss function. - Total loss function: \[ L_{\text{total}}=\lambda L_{\text{pretrain}}+(1 - \lambda)L_{\text{task}} \] where \(\lambda\) balances the contributions of pre - training and fine - tuning. Through these improvements, SSMI can significantly reduce the number of parameter updates while maintaining high performance, and is suitable for resource - constrained environments.

Selective State Space Memory for Large Vision-Language Models

VSSD: Vision Mamba with Non-Causal State Space Duality

A-VL: Adaptive Attention for Large Vision-Language Models

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

VL-Mamba: Exploring State Space Models for Multimodal Learning

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

InfMLLM: A Unified Framework for Visual-Language Tasks.

Distillation-free Scaling of Large SSMs for Images and Videos

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

Sequential Modeling Enables Scalable Learning for Large Vision Models

Are We on the Right Way for Evaluating Large Vision-Language Models?

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

LocalMamba: Visual State Space Model with Windowed Selective Scan