Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network

Bin Cheng,Jiaxuan Lu

2024-08-01

Abstract:With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, some current parameter-efficient fine-tuning methods only model a single modality and lack the utilization of structural knowledge in downstream tasks. To address this issue, this paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks. Each image is fed into a multi-modal large language model (MLLM) to generate a text description. The image and its corresponding text description are then processed by a frozen image encoder and text encoder to generate image features and text features, respectively. A graph is constructed based on the similarity of the multi-modal feature nodes, and knowledge and relationships relevant to these features are extracted from each node. Additionally, Elastic Weight Consolidation (EWC) regularization is incorporated into the loss function to mitigate the problem of forgetting during task learning. The proposed model achieves test accuracies on the OxfordPets, Flowers102, and Food101 datasets that improve by 4.45%, 2.92%, and 0.23%, respectively. The code is available at <a class="link-external link-https" href="https://github.com/yunche0/GA-Net/tree/master" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of current parameter-efficient fine-tuning methods modeling only a single modality in low-data environments and lacking the utilization of structural knowledge in downstream tasks. Specifically: 1. **Multimodal Parameter-Efficient Fine-Tuning**: Proposes a multimodal parameter-efficient fine-tuning method based on Graph Neural Networks (GNN), which better captures the complex associations between different modalities by combining image and text information. 2. **Utilization of Structural Knowledge**: Utilizes graph structures to extract relevant knowledge for each multimodal feature node, thereby fully learning text and image information and considering their adjacency relationships. 3. **Preventing Forgetting Issues**: Introduces Elastic Weight Consolidation (EWC) regularization into the loss function to mitigate the forgetting problem in task learning. Experimental results show that this method improves test accuracy by 4.45%, 2.92%, and 0.23% on the Oxford Pets, Flowers102, and Food101 datasets, respectively, outperforming existing state-of-the-art methods. Additionally, the model also performs well in terms of the number of parameters and memory consumption, achieving a good balance between performance and parameter count.

Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network

Parameter-Efficient Tuning Large Language Models for Graph Representation Learning

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks

AdapterGNN: Parameter-Efficient Fine-Tuning Improves Generalization in GNNs

Towards a Unified View of Parameter-Efficient Transfer Learning

Discovering Long-Term Effects on Parameter Efficient Fine-tuning

Adaptive Principal Components Allocation with the $\ell_{2,g}$-regularized Gaussian Graphical Model for Efficient Fine-Tuning Large Models

Multi-modal Graph Contrastive Encoding for Neural Machine Translation

Multi-scale Network Via Progressive Multi-Granularity Attention for Fine-Grained Visual Classification

Parameter-efficient Tuning of Large-scale Multimodal Foundation Model

Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Multi-directional guidance network for fine-grained visual classification

Gradient-based Parameter Selection for Efficient Fine-Tuning

Graph Metanetworks for Processing Diverse Neural Architectures

Bridging Pre-Trained Models to Continual Learning: A Hypernetwork Based Framework with Parameter-Efficient Fine-Tuning Techniques

Meta-GF: Training Dynamic-Depth Neural Networks Harmoniously.

Efficient Multi-domain Text Recognition Deep Neural Network Parameterization with Residual Adapters

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

Refining Joint Text and Source Code Embeddings for Retrieval Task with Parameter-Efficient Fine-Tuning