Abstract:The labor-intensive annotation process of semantic segmentation datasets is often prone to errors, since humans struggle to label every pixel correctly. We study algorithms to automatically detect such annotation errors, in particular methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled. This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset, which is critical in sensitive applications such as medical imaging and autonomous vehicles. Widely applicable, our label quality scores rely on probabilistic predictions from a trained segmentation model -- any model architecture and training procedure can be utilized. Here we study 7 different label quality scoring methods used in conjunction with a DeepLabV3+ or a FPN segmentation model to detect annotation errors in a version of the SYNTHIA dataset. Precision-recall evaluations reveal a score -- the soft-minimum of the model-estimated likelihoods of each pixel's annotated class -- that is particularly effective to identify images that are mislabeled, across multiple types of annotation error.

What problem does this paper attempt to address?

The paper attempts to address the issue of data quality problems in semantic segmentation datasets caused by errors during the manual annotation process. Specifically, the paper investigates algorithms for automatically detecting these annotation errors, particularly methods for assessing label quality, in order to prioritize the review of images that are most likely to be incorrectly labeled. This helps ensure that the training/evaluation datasets used in sensitive applications (such as medical imaging and autonomous driving) are of high quality. ### Background and Problem - **Background**: The task of semantic segmentation requires classifying every pixel in an image, which is a fine-grained image understanding task. In recent years, the scale of image datasets has significantly increased to train more effective segmentation models, especially in fields such as radiology, pathology, robotics, and autonomous driving. - **Problem**: Annotating semantic segmentation data requires pixel-by-pixel labeling of images, which is a highly labor-intensive and error-prone process. Therefore, using incorrect labels during model training is obviously problematic, and even using noisy labels during model evaluation can raise concerns in high-risk applications. ### Research Objectives - **Primary Objective**: Develop a general method to assess the label quality in semantic segmentation datasets and detect annotation errors. - **Specific Objectives**: - Develop a label quality scoring method that can be applied to any segmentation model. - Experimentally validate the effectiveness of different scoring methods, particularly their performance under different types of annotation errors (omission, swap, shift). - Provide an efficient and accurate label quality scoring method to help prioritize the review and correction of incorrectly labeled images. ### Methods and Experiments - **Methods**: The paper investigates 7 different label quality scoring methods, including the Softmin scoring method based on model prediction probabilities. - **Experiments**: Experiments were conducted using two segmentation models, DeepLabV3+ and FPN, on different versions of the SYNTHIA dataset. Three common types of annotation errors (omission, swap, shift) were introduced, and the performance of various methods was evaluated using metrics such as precision-recall curves (PR curves). ### Results and Conclusions - **Results**: Experimental results show that the Softmin scoring method is the most effective in detecting various types of annotation errors, especially in terms of high precision and high recall. - **Conclusions**: The Softmin scoring method is a general and effective method that can be applied to any segmentation model to help detect and correct annotation errors in semantic segmentation datasets, thereby improving the quality of the datasets. ### Application Prospects - **Practical Applications**: This method can help researchers and engineers ensure the high quality of training and evaluation datasets in high-risk fields such as medical imaging and autonomous driving, thereby improving the reliability and accuracy of models.

Estimating label quality and errors in semantic segmentation data via any model

Troubleshooting image segmentation models with human-in-the-loop

Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification

Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets

A Novel Quality Evaluating Method for Over-Segmentation Approaches Using Real-Time Boundary Information

Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation

How to Efficiently Annotate Images for Best-Performing Deep Learning Based Segmentation Models: An Empirical Study with Weak and Noisy Annotations and Segment Anything Model

Diagnostics in Semantic Segmentation

Pick-and-Learn: Automatic Quality Evaluation for Noisy-Labeled Image Segmentation

Semantic Segmentation of Airborne LiDAR Point Clouds With Noisy Labels

Identifying Label Errors in Object Detection Datasets by Loss Inspection

Label Smarter, Not Harder: CleverLabel for Faster Annotation of Ambiguous Image Classification with Higher Quality

Learning to Segment from Noisy Annotations: A Spatial Correction Approach

Semantic Segmentation of Weakly Annotated Remote Sensing Images Based on Feature Adversary and Uncertainty Perception

An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets

Trusting Semantic Segmentation Networks

A Weakly-Supervised Semantic Segmentation Approach Based on the Centroid Loss: Application to Quality Control and Inspection

Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation

ErrorAug: Making Errors to Find Errors in Semantic Segmentation

SAM Carries the Burden: A Semi-Supervised Approach Refining Pseudo Labels for Medical Segmentation

Push the Boundary of SAM: A Pseudo-label Correction Framework for Medical Segmentation