Dynamic Data Fault Localization for Deep Neural Networks

Yining Yin,Yang Feng,Shihao Weng,Zixi Liu,Yuan Yao,Yichi Zhang,Zhihong Zhao,Zhenyu Chen
DOI: https://doi.org/10.1145/3611643.3616345
2023-01-01
Abstract:Rich datasets have empowered various deep learning (DL) applications, leading to remarkable success in many fields. However, data faults hidden in the datasets could result in DL applications behaving unpredictably and even cause massive monetary and life losses. To alleviate this problem, in this paper, we propose a dynamic data fault localization approach, namely DFauLo, to locate the mislabeled and noisy data in the deep learning datasets. DFauLo is inspired by the conventional mutation-based code fault localization, but utilizes the differences between DNN mutants to amplify and identify the potential data faults. Specifically, it first generates multiple DNN model mutants of the original trained model. Then it extracts features from these mutants and maps them into a suspiciousness score indicating the probability of the given data being a data fault. Moreover, DFauLo is the first dynamic data fault localization technique, prioritizing the suspected data based on user feedback, and providing the generalizability to unseen data faults during training. To validate DFauLo, we extensively evaluate it on 26 cases with various fault types, data types, and model structures. We also evaluate DFauLo on three widely-used benchmark datasets. The results show that DFauLo outperforms the state-of-the-art techniques in almost all cases and locates hundreds of different types of real data faults in benchmark datasets.
What problem does this paper attempt to address?