Abstract:Visual instruction tuning is the key to building large vision language models~(LVLMs), which can greatly improve the task generalization and solving capabilities by learning a mixture of instruction data from diverse visual tasks. Previous work mostly collects multiple existing visual instruction datasets via heuristic ways for training (even more than a million instructions), which may introduce data redundancy and enlarge the training cost. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of instructions from several tasks even do not affect the performance. Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, we first estimate the instance influence score on its corresponding task, and the task difficulty score, based on the gradient-based influence functions. Then, we leverage the two kinds of scores to determine the task proportion within the selected visual instruction subset, and select high-value instances for each task, respectively. Experiments on various LVLMs show that our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address the issue of data redundancy faced by large-scale vision-language models (LVLMs) during visual instruction tuning. Specifically, existing LVLMs are typically trained by collecting multiple existing visual instruction datasets, which may contain millions of instructions. This approach not only increases training costs but may also lead to data redundancy, thereby affecting the model's performance. To explore this issue, the authors conducted a series of empirical studies and found significant redundancy in existing visual instruction datasets. Further research indicated that even significantly reducing the number of instructions for certain tasks does not substantially impact the model's performance. Based on these findings, the authors proposed a high-value data selection method—TIVE (Task and Instance Value Estimation) to eliminate data redundancy and reduce training costs. ### Method Overview The core of the TIVE method lies in estimating the impact of each instance on the task and the difficulty of the task, and selecting a high-value data subset based on these estimates. The specific steps are as follows: 1. **Task Difficulty Estimation**: Measure the difficulty of a task by calculating the self-influence score of all instances within the task. The self-influence score reflects the impact of training a particular instance on its own learning. The difficulty score of a task is the average of the self-influence scores of all its instances. 2. **Instance Influence Estimation**: Measure the value of an instance by calculating its influence score on other instances within its task. The instance influence score is the average gradient similarity of the instance to other instances in the task. 3. **Data Subset Selection**: Determine the proportion of each task in the final data subset based on task difficulty and instance influence scores, and select high-value instances from each task. Specifically, the proportion of a task is determined by its difficulty score, while the selection of instances is based on the softmax distribution of instance influence scores. ### Experimental Results The authors conducted extensive experiments on multiple LVLMs and different benchmarks to validate the effectiveness of the TIVE method. The experimental results showed that using a data subset selected by the TIVE method, which accounts for only 15% of the total data, can achieve comparable average performance to models fine-tuned with the full dataset, and even surpass the full dataset fine-tuned models on four benchmarks. This indicates that the TIVE method can effectively reduce data redundancy, improve training efficiency, and maintain or enhance model performance. ### Conclusion This paper reveals the redundancy issue in existing visual instruction datasets through empirical studies and proposes an efficient data selection method, TIVE, to reduce data redundancy and optimize the training process of LVLMs. The experimental results demonstrate that the TIVE method performs excellently across multiple models and datasets, showing broad application prospects.

Less is More: High-value Data Selection for Visual Instruction Tuning

Maybe Only 0.5 Training Data Instruction Tuning

Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection

LESS: Selecting Influential Data for Targeted Instruction Tuning

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Vision-Language Instruction Tuning: A Review and Analysis

Instruction Matters: A Simple yet Effective Task Selection for Optimized Instruction Tuning of Specific Tasks

A Survey on Data Selection for LLM Instruction Tuning

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

SVIT: Scaling up Visual Instruction Tuning

Rethinking Overlooked Aspects in Vision-Language Models

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning