DsDm: Model-Aware Dataset Selection with Datamodels

Logan Engstrom,Axel Feldmann,Aleksander Madry
2024-01-24
Abstract:When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Machine Learning
What problem does this paper attempt to address?
This paper discusses the problem of how to select data when training large-scale machine learning models. The traditional approach is to filter out low-quality data, but research has found that methods based on the similarity of "high-quality" data sources may not necessarily improve (and may even harm) model performance. The authors propose a new optimization framework that treats dataset selection as a direct optimization problem, aiming to find a subset that maximizes model performance instead of relying on subjective data quality standards. They introduce a method called DsDm (Dataset Selection with Datamodels), which uses data models to estimate how learning algorithms utilize training data for prediction and selects a subset that maximizes the estimated performance. DsDm performs well on various language modeling tasks, even when existing selection methods fail. By selecting representative standard language modeling problems as target tasks, DsDm improves computational efficiency by a factor of two on a range of retained test benchmarks. Experimental results show that DsDm improves model performance on both predefined tasks and unseen tasks, outperforming standard selection methods based on text similarity. DsDm is particularly effective on benchmarks related to the target task, such as reading comprehension and world knowledge, while maintaining stable performance in other categories. The paper also emphasizes the importance of selecting diverse target tasks to adapt to different downstream problems and indicates that DsDm is necessary for leveraging specific tasks to improve model behavior.