Estimating label quality and errors in semantic segmentation data via any model

Vedang Lad,Jonas Mueller
2023-07-11
Abstract:The labor-intensive annotation process of semantic segmentation datasets is often prone to errors, since humans struggle to label every pixel correctly. We study algorithms to automatically detect such annotation errors, in particular methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled. This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset, which is critical in sensitive applications such as medical imaging and autonomous vehicles. Widely applicable, our label quality scores rely on probabilistic predictions from a trained segmentation model -- any model architecture and training procedure can be utilized. Here we study 7 different label quality scoring methods used in conjunction with a DeepLabV3+ or a FPN segmentation model to detect annotation errors in a version of the SYNTHIA dataset. Precision-recall evaluations reveal a score -- the soft-minimum of the model-estimated likelihoods of each pixel's annotated class -- that is particularly effective to identify images that are mislabeled, across multiple types of annotation error.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of data quality problems in semantic segmentation datasets caused by errors during the manual annotation process. Specifically, the paper investigates algorithms for automatically detecting these annotation errors, particularly methods for assessing label quality, in order to prioritize the review of images that are most likely to be incorrectly labeled. This helps ensure that the training/evaluation datasets used in sensitive applications (such as medical imaging and autonomous driving) are of high quality. ### Background and Problem - **Background**: The task of semantic segmentation requires classifying every pixel in an image, which is a fine-grained image understanding task. In recent years, the scale of image datasets has significantly increased to train more effective segmentation models, especially in fields such as radiology, pathology, robotics, and autonomous driving. - **Problem**: Annotating semantic segmentation data requires pixel-by-pixel labeling of images, which is a highly labor-intensive and error-prone process. Therefore, using incorrect labels during model training is obviously problematic, and even using noisy labels during model evaluation can raise concerns in high-risk applications. ### Research Objectives - **Primary Objective**: Develop a general method to assess the label quality in semantic segmentation datasets and detect annotation errors. - **Specific Objectives**: - Develop a label quality scoring method that can be applied to any segmentation model. - Experimentally validate the effectiveness of different scoring methods, particularly their performance under different types of annotation errors (omission, swap, shift). - Provide an efficient and accurate label quality scoring method to help prioritize the review and correction of incorrectly labeled images. ### Methods and Experiments - **Methods**: The paper investigates 7 different label quality scoring methods, including the Softmin scoring method based on model prediction probabilities. - **Experiments**: Experiments were conducted using two segmentation models, DeepLabV3+ and FPN, on different versions of the SYNTHIA dataset. Three common types of annotation errors (omission, swap, shift) were introduced, and the performance of various methods was evaluated using metrics such as precision-recall curves (PR curves). ### Results and Conclusions - **Results**: Experimental results show that the Softmin scoring method is the most effective in detecting various types of annotation errors, especially in terms of high precision and high recall. - **Conclusions**: The Softmin scoring method is a general and effective method that can be applied to any segmentation model to help detect and correct annotation errors in semantic segmentation datasets, thereby improving the quality of the datasets. ### Application Prospects - **Practical Applications**: This method can help researchers and engineers ensure the high quality of training and evaluation datasets in high-risk fields such as medical imaging and autonomous driving, thereby improving the reliability and accuracy of models.