Find Important Training Dataset by Observing the Training Sequence Similarity

Zhengchang Liu,Hang Diao,Fan Zhang,Samee U. Khan
DOI: https://doi.org/10.1007/978-3-031-44213-1_34
2023-01-01
Abstract:It is imperative to eliminate training data that has minimal impact on model accuracy. In addition to eliminating training data that share similar features, we propose a novel concept called training sequence, which signifies the trajectory of each training data in terms of correct or incorrect prediction during each training epoch. We eliminate training data that exhibit similar training trajectories. We complement this approach with the identification of hard-to-forget training data that consistently demonstrate accurate prediction. We conducted extensive experiments on various classical classification tasks and compared our approach with forgetting-score method. Our experimental findings demonstrate that our approach outperforms the forgetting-score approach by up to 13.2% and is particularly effective at low training data retention ratios, implying that our method can choose important training datasets with satisfactory performance. Our open-source code is available at the following link: https://github.com/sheldonlll/angle_method .
What problem does this paper attempt to address?