Mining Influential Training Data by Tracing Influence on Hard Validation Samples

Qikai Zhang,Fan Zhang,Samee U. Khan
DOI: https://doi.org/10.1109/ICTAI56018.2022.00032
2022-01-01
Abstract:The ever-growing deep learning model size is constantly driven by the ever-growing dataset size. Mining the influential training data has significant payoff of either reducing the training time, model complexity as well as potentially increasing the model accuracy. In this paper, we propose a few approaches, e.g. classifying the validation dataset into easy, medium and hard levels, introducing influence value by calculating each training data on the hard validation data, to co-prune the validation dataset and the training dataset. Empirically we conclude that the portion of the hard validation data could be used to mine the most influential training data, whereby reducing the training dataset size by 50% without losing accuracy in our experiments.
What problem does this paper attempt to address?