TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu,Amin Karbasi,Theodoros Rekatsinas

2024-10-23

Abstract:Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

Machine Learning,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to select appropriate training data in the fine - tuning of specific - task models. Specifically, the paper proposes a framework named TSDS (Task - Specific Data Selection), which aims to select a data set suitable for fine - tuning specific tasks from a large amount of candidate data. This process is guided by a small number of representative target - task examples. TSDS is achieved by formulating the data - selection problem as an optimization problem, which takes the distribution - alignment loss based on optimal transport as the objective function, while introducing a regularization term to encourage the diversity of the selected data, and reducing the negative impact of near - duplicates in the candidate data through kernel - density estimation. In addition, the paper also designs efficient algorithms to calculate the optimal solution of the optimization problem, which are based on approximate nearest - neighbor search techniques. The paper evaluates the proposed method on the instruction - tuning of language models and domain - specific continuous pre - training tasks, demonstrating its advantages over existing methods.

TSDS: Data Selection for Task-Specific Model Finetuning

Data Selection for Task-Specific Model Finetuning

Reinforced training data selection for domain adaptation

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Rethinking Data Selection for Supervised Fine-Tuning

Compute-Constrained Data Selection

Data-Efficient Finetuning Using Cross-Task Nearest Neighbors

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Towards Accelerated Model Training via Bayesian Data Selection

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models

Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation

Two-Stage Fine-Tuning: A Novel Strategy for Learning Class-Imbalanced Data

LESS: Selecting Influential Data for Targeted Instruction Tuning

Model Balancing Helps Low-data Training and Fine-tuning

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

MoDS: Model-oriented Data Selection for Instruction Tuning