Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network

Bin Cheng,Jiaxuan Lu
2024-08-01
Abstract:With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, some current parameter-efficient fine-tuning methods only model a single modality and lack the utilization of structural knowledge in downstream tasks. To address this issue, this paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks. Each image is fed into a multi-modal large language model (MLLM) to generate a text description. The image and its corresponding text description are then processed by a frozen image encoder and text encoder to generate image features and text features, respectively. A graph is constructed based on the similarity of the multi-modal feature nodes, and knowledge and relationships relevant to these features are extracted from each node. Additionally, Elastic Weight Consolidation (EWC) regularization is incorporated into the loss function to mitigate the problem of forgetting during task learning. The proposed model achieves test accuracies on the OxfordPets, Flowers102, and Food101 datasets that improve by 4.45%, 2.92%, and 0.23%, respectively. The code is available at <a class="link-external link-https" href="https://github.com/yunche0/GA-Net/tree/master" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of current parameter-efficient fine-tuning methods modeling only a single modality in low-data environments and lacking the utilization of structural knowledge in downstream tasks. Specifically: 1. **Multimodal Parameter-Efficient Fine-Tuning**: Proposes a multimodal parameter-efficient fine-tuning method based on Graph Neural Networks (GNN), which better captures the complex associations between different modalities by combining image and text information. 2. **Utilization of Structural Knowledge**: Utilizes graph structures to extract relevant knowledge for each multimodal feature node, thereby fully learning text and image information and considering their adjacency relationships. 3. **Preventing Forgetting Issues**: Introduces Elastic Weight Consolidation (EWC) regularization into the loss function to mitigate the forgetting problem in task learning. Experimental results show that this method improves test accuracy by 4.45%, 2.92%, and 0.23% on the Oxford Pets, Flowers102, and Food101 datasets, respectively, outperforming existing state-of-the-art methods. Additionally, the model also performs well in terms of the number of parameters and memory consumption, achieving a good balance between performance and parameter count.