ENLD: Efficient Noisy Label Detection for Incremental Datasets in Data Lake.

Xuanke You,Lan Zhang,Junyang Wang,Zhimin Bao,Yunfei Wu,Shuaishuai Dong
DOI: https://doi.org/10.1109/icde55515.2023.00151
2023-01-01
Abstract:Due to the difficulty of obtaining high-quality data in real-world scenarios, datasets inevitably contain noisy labeled data, leading to inefficient data usage and poor model performance. Thus, noisy label detection is an important research topic. Previous efforts mainly focus on noisy label detection on specific datasets that have been collected. Some works select clean samples based on relations between representations during the training process; some works utilize confidence outputs of a pre-trained model for noisy label detection. However, how to perform efficient and fine-grained noisy label detection on constantly arriving datasets in a data lake with a large amount of inventory data has not been explored. The rapidly growing volume and changing distribution of data make conventional methods either incur large computation overhead due to repeated training or become increasingly ineffective on newly arriving data. To address these challenges, in this work, we propose a novel approach ENLD to perform efficient and accurate noisy label detection on incremental datasets. Our extensive experiments demonstrate that ENLD outperforms the next best method in both efficiency and accuracy, which achieves 3.65 ×-4.97× detection speedup and higher average f1 scores with various noise rate settings.
What problem does this paper attempt to address?