Abstract:Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing the performance of MLLMs in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies using four popular PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of the PEFT module, size of fine-tuning data, model stability based on PEFT methods, MLLM's generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories: unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. Code and data are available at <a class="link-external link-https" href="https://github.com/alenai97/PEFT-MLLM.git" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper mainly discusses how to effectively fine-tune multi-modal large language models (MLLMs) with parameter-efficient fine-tuning (PEFT). As the number of parameters in MLLMs continues to increase, full parameter fine-tuning becomes challenging. Researchers have studied the application of four popular PEFT methods (adapters, LoRA, IA3, and prefix fine-tuning) on the LLM component of open-source MLLMs, and conducted comprehensive analysis on different module positions, data scales, model stability and generalization ability, as well as hallucination phenomena. 1. The study found that fine-tuning the connecting layer can generally improve the performance of MLLMs. 2. On unseen datasets, more trainable parameters lead to better performance, while on seen datasets, fewer trainable parameters can maintain model performance. 3. Using large-scale datasets usually leads to better performance, but when resources are limited, medium-scale datasets are also feasible choices. 4. Adapters demonstrate the best overall performance in terms of model generalization, stability, and hallucination, followed by LoRA. 5. The paper also compared the performance of different PEFT methods when fine-tuning the connecting layer and not fine-tuning the connecting layer, and found that on unseen datasets, fine-tuning the connecting layer is generally superior to freezing the connecting layer, while on seen datasets, freezing the connecting layer has better effects. In addition, the paper conducted experiments on the position of PEFT modules, data scales, overfitting, and generalization ability to determine which PEFT method is most effective under different circumstances. The results show that for adapters and LoRA, inserting PEFT modules in both the attention layer and MLP layer yields the best results, with adapters performing best in the MLP layer. With the increase of available resources, the performance of all PEFT methods is improved, especially on unseen datasets. In stability analysis, adapters and LoRA exhibit stronger robustness in certain scenarios, while other methods show unstable performance on different datasets. Overall, this paper provides valuable insights into PEFT for MLLMs through empirical studies, which helps optimize the performance and generalization ability of models under limited resources.

An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper

Can LLMs' Tuning Methods Work in Medical Multimodal Domain?

Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs

Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models

An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model

MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models

Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning

Parameter-efficient fine-tuning of large-scale pre-trained language models

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Arbitrary Few Parameters Are Good Enough for Adapting Large-scale Pre-trained Language Models

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies