HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang,Tianwei Lin,Jiang Liu,Fangxun Shu,Haoyuan Li,Lei Zhang,He Wanggui,Hao Zhou,Zheqi Lv,Hao Jiang,Juncheng Li,Siliang Tang,Yueting Zhuang

2024-03-20

Abstract:Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training.

Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the performance limitations of multimodal large language models (MLLMs) in various downstream tasks. Current MLLMs, such as LLaVA, typically employ a static vision-language mapper that converts visual features into text-like tokens, enabling the static large language model (LLM) to understand visual information through visual instruction tuning. However, this static tuning strategy shares the same parameters, which may limit the model's performance in different downstream multimodal tasks. To overcome this limitation, the authors propose HyperLLaVA, a dynamic tuning approach that includes adaptive tuning projectors and LLM parameters, combined with dynamic visual experts and language experts. These expert modules generate adaptive parameter offsets through a hypernetwork, enabling dynamic projector and LLM modeling in a two-stage training process. Experimental results show that HyperLLaVA significantly outperforms LLaVA in existing MLLM benchmarks, including MME, MM-Bench, SEED-Bench, and LLaVA-Bench. ### Main Contributions 1. **Dynamic Tuning Strategy**: Investigated the dynamic tuning strategy for MLLMs and introduced HyperLLaVA, which optimizes projectors and LLMs using vision and language-guided dynamic tuning. 2. **Efficient Multi-task Fine-tuning Method**: The proposed visual and language expert modules offer a parameter-efficient multi-task fine-tuning method. 3. **Extensive Experimental Validation**: Conducted comprehensive and detailed experiments on multiple MLLM benchmarks, demonstrating the effectiveness and generality of the proposed method.

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound