An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models

Xiongtao Zhou,Jie He,Yuhua Ke,Guangyao Zhu,Víctor Gutiérrez-Basulto,Jeff Z. Pan
2024-06-08
Abstract:Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing the performance of MLLMs in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies using four popular PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of the PEFT module, size of fine-tuning data, model stability based on PEFT methods, MLLM's generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories: unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. Code and data are available at <a class="link-external link-https" href="https://github.com/alenai97/PEFT-MLLM.git" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
This paper mainly discusses how to effectively fine-tune multi-modal large language models (MLLMs) with parameter-efficient fine-tuning (PEFT). As the number of parameters in MLLMs continues to increase, full parameter fine-tuning becomes challenging. Researchers have studied the application of four popular PEFT methods (adapters, LoRA, IA3, and prefix fine-tuning) on the LLM component of open-source MLLMs, and conducted comprehensive analysis on different module positions, data scales, model stability and generalization ability, as well as hallucination phenomena. 1. The study found that fine-tuning the connecting layer can generally improve the performance of MLLMs. 2. On unseen datasets, more trainable parameters lead to better performance, while on seen datasets, fewer trainable parameters can maintain model performance. 3. Using large-scale datasets usually leads to better performance, but when resources are limited, medium-scale datasets are also feasible choices. 4. Adapters demonstrate the best overall performance in terms of model generalization, stability, and hallucination, followed by LoRA. 5. The paper also compared the performance of different PEFT methods when fine-tuning the connecting layer and not fine-tuning the connecting layer, and found that on unseen datasets, fine-tuning the connecting layer is generally superior to freezing the connecting layer, while on seen datasets, freezing the connecting layer has better effects. In addition, the paper conducted experiments on the position of PEFT modules, data scales, overfitting, and generalization ability to determine which PEFT method is most effective under different circumstances. The results show that for adapters and LoRA, inserting PEFT modules in both the attention layer and MLP layer yields the best results, with adapters performing best in the MLP layer. With the increase of available resources, the performance of all PEFT methods is improved, especially on unseen datasets. In stability analysis, adapters and LoRA exhibit stronger robustness in certain scenarios, while other methods show unstable performance on different datasets. Overall, this paper provides valuable insights into PEFT for MLLMs through empirical studies, which helps optimize the performance and generalization ability of models under limited resources.