Practical Accuracy Evaluation for Deep Learning Systems Via Latent Representation Discrepancy.

Yining Yin,Yang Feng,Zixi Liu,Zhihong Zhao
DOI: https://doi.org/10.1145/3609437.3609457
2023-01-01
Abstract:As deep learning systems have been widely deployed in many safety-critical scenarios, their quality and reliability have raised growing concerns. Assuring the quality and evaluating the accuracy of deep learning models could be challenging because, unlike traditional software, DL systems rely on large amounts of labeled data for training and evaluation. The DL models have variability in their behavioral features on datasets with different distributions. In practical application, the potential distribution shift between training and usage scenarios may have an impact on the performance of the model and bring extra vulnerability to DL systems. Although some neuron coverage testing criteria have been proposed to assist in testing the DL systems, they are still limited by the amount of labeled data. Meanwhile, manual labeling test data collected from real-world application scenarios is very time-consuming and costly. In this paper, we propose a novel testing metric, namely LRD, to evaluate the practical accuracy of deep learning systems without requiring the ground truth of test data. The metric uses optimal transport theory to compare model behavior on real-world test data to that on training and out-of-distribution (OOD) sets, by extracting latent representations from the model during input data processing and constructing representation patterns based on the training dataset. The paper further introduces two algorithms powered by the latent representation for out-of-distribution (OOD) data detection and LRD-guided test selection for model retraining. The experimental results show that the evaluation results of LRD have a significant positive correlation with the actual accuracy of the model, and the proposed algorithms are more effective than related OOD detection and test prioritization techniques.
What problem does this paper attempt to address?