TSDS: Data Selection for Task-Specific Model Finetuning

Zifan Liu,Amin Karbasi,Theodoros Rekatsinas
2024-10-23
Abstract:Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to select appropriate training data in the fine - tuning of specific - task models. Specifically, the paper proposes a framework named TSDS (Task - Specific Data Selection), which aims to select a data set suitable for fine - tuning specific tasks from a large amount of candidate data. This process is guided by a small number of representative target - task examples. TSDS is achieved by formulating the data - selection problem as an optimization problem, which takes the distribution - alignment loss based on optimal transport as the objective function, while introducing a regularization term to encourage the diversity of the selected data, and reducing the negative impact of near - duplicates in the candidate data through kernel - density estimation. In addition, the paper also designs efficient algorithms to calculate the optimal solution of the optimization problem, which are based on approximate nearest - neighbor search techniques. The paper evaluates the proposed method on the instruction - tuning of language models and domain - specific continuous pre - training tasks, demonstrating its advantages over existing methods.