Selective State Space Memory for Large Vision-Language Models

Chee Ng,Yuen Fung
2024-12-13
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model's parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively fine - tune large - scale vision - language models (LVLMs) for domain - specific tasks while maintaining high efficiency and scalability?** Specifically, although existing large - scale vision - language models perform well in a variety of multi - modal tasks, they still face the challenges of intensive computational resources and large memory occupation when fine - tuning for domain - specific applications. Traditional fine - tuning methods usually need to update a large number of model parameters, which not only has high computational costs but may also damage the original representation ability of the model. In addition, the method of adding external adapters also has problems such as poor scalability and difficulty in capturing long - distance dependencies. To solve these problems, this paper proposes a new fine - tuning method - **State Space Memory Integration (SSMI)**. SSMI can effectively capture long - distance dependencies and inject task - specific visual and sequence patterns by integrating the Mamba - based state - space module into the LVLM architecture. Compared with traditional methods, SSMI only needs to update a small part of the model parameters, thereby significantly reducing the computational cost and improving the adaptability and interpretability of the model. ### Main contributions: 1. **Proposed the SSMI method**: By introducing the Mamba - based state - space module, effective fine - tuning of LVLM is achieved. 2. **Demonstrated superior performance**: The experimental results on multiple benchmark datasets show that SSMI is superior to existing methods in performance, and at the same time has higher parameter efficiency and computational scalability. 3. **Provided comprehensive experimental verification**: Through detailed ablation studies and comparative analyses, the effectiveness and robustness of SSMI in different vision - language tasks are verified. ### Formula summary: - State - space dynamic equation: \[ s_{t + 1}=A s_t + B h_t \] \[ y_t = C s_t + D h_t \] where \(s_t\) is the state at time \(t\), \(h_t\) is the input, \(y_t\) is the output, and \(A, B, C, D\) are learnable parameters. - State - space output after frequency - domain discretization: \[ Y = C(I - zA)^{- 1}BH + DH \] where \(H\) is the input sequence and \(z\) is the discretization operator. - Pre - training loss function: \[ L_{\text{pretrain}}=\frac{1}{T}\sum_{t = 1}^{T}\|y_t - t_t\|^2_2 \] where \(y_t\) is the output of the state - space model and \(t_t\) is the target text embedding. - Task - specific fine - tuning loss function: \[ L_{\text{task}}=E(X, Y)[L(\hat{Y}, Y)] \] where \(\hat{Y}\) is the model's prediction and \(L\) is the task - specific loss function. - Total loss function: \[ L_{\text{total}}=\lambda L_{\text{pretrain}}+(1 - \lambda)L_{\text{task}} \] where \(\lambda\) balances the contributions of pre - training and fine - tuning. Through these improvements, SSMI can significantly reduce the number of parameter updates while maintaining high performance, and is suitable for resource - constrained environments.