HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang,Tianwei Lin,Jiang Liu,Fangxun Shu,Haoyuan Li,Lei Zhang,He Wanggui,Hao Zhou,Zheqi Lv,Hao Jiang,Juncheng Li,Siliang Tang,Yueting Zhuang
2024-03-20
Abstract:Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the performance limitations of multimodal large language models (MLLMs) in various downstream tasks. Current MLLMs, such as LLaVA, typically employ a static vision-language mapper that converts visual features into text-like tokens, enabling the static large language model (LLM) to understand visual information through visual instruction tuning. However, this static tuning strategy shares the same parameters, which may limit the model's performance in different downstream multimodal tasks. To overcome this limitation, the authors propose HyperLLaVA, a dynamic tuning approach that includes adaptive tuning projectors and LLM parameters, combined with dynamic visual experts and language experts. These expert modules generate adaptive parameter offsets through a hypernetwork, enabling dynamic projector and LLM modeling in a two-stage training process. Experimental results show that HyperLLaVA significantly outperforms LLaVA in existing MLLM benchmarks, including MME, MM-Bench, SEED-Bench, and LLaVA-Bench. ### Main Contributions 1. **Dynamic Tuning Strategy**: Investigated the dynamic tuning strategy for MLLMs and introduced HyperLLaVA, which optimizes projectors and LLMs using vision and language-guided dynamic tuning. 2. **Efficient Multi-task Fine-tuning Method**: The proposed visual and language expert modules offer a parameter-efficient multi-task fine-tuning method. 3. **Extensive Experimental Validation**: Conducted comprehensive and detailed experiments on multiple MLLM benchmarks, demonstrating the effectiveness and generality of the proposed method.