Abstract:Active learning is a commonly used approach that reduces the labeling effort required to train deep neural networks. However, the effectiveness of current active learning methods is limited by their closed-world assumptions, which assume that all data in the unlabeled pool comes from a set of predefined known classes. This assumption is often not valid in practical situations, as there may be unknown classes in the unlabeled data, leading to the active open-set annotation problem. The presence of unknown classes in the data can significantly impact the performance of existing active learning methods due to the uncertainty they introduce. To address this issue, we propose a novel data-centric active learning method called NEAT that actively annotates open-set data. NEAT is designed to label known classes data from a pool of both known and unknown classes unlabeled data. It utilizes the clusterability of labels to identify the known classes from the unlabeled pool and selects informative samples from those classes based on a consistency criterion that measures inconsistencies between model predictions and local feature distribution. Unlike the recently proposed learning-centric method for the same problem, NEAT is much more computationally efficient and is a data-centric active open-set annotation method. Our experiments demonstrate that NEAT achieves significantly better performance than state-of-the-art active learning methods for active open-set annotation.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered in active learning in the open world, especially how to effectively label samples of known categories when there are unknown categories in the data set. Traditional active learning methods usually assume that all unlabeled data belong to known categories, which is not always true in practical applications. When the data set contains unknown categories, these traditional methods may mislabel samples from unknown categories, thus affecting the performance of the model.
Specifically, the paper introduces a new method named NEAT, which is a data - driven active learning method based on inconsistency for processing open - set data. NEAT distinguishes known and unknown categories through clustering criteria and consistency measures, thus avoiding mislabeling samples of unknown categories. Compared with existing learning - driven methods, NEAT is not only more computationally efficient but also performs well in selecting informative samples of known categories.
### Specific problems solved by the paper:
1. **Active learning in the open world**: Traditional active learning methods work well under the closed - world assumption, that is, assuming that all unlabeled data belong to known categories. However, in the open world, unlabeled data may contain unknown categories, which causes traditional methods to fail.
2. **Identification of unknown categories**: Existing active learning methods perform poorly when dealing with unknown categories and may mislabel samples of unknown categories, thus affecting the training effect of the model.
3. **Computational efficiency**: Existing learning - driven methods need to train an additional detection network to distinguish between known and unknown categories, which increases the computational cost. NEAT, through a data - driven method, avoids this additional computational overhead.
### Main contributions of NEAT:
- **Proposing a new data - driven method**: NEAT distinguishes known and unknown categories through clustering criteria and consistency measures, avoiding mislabeling samples of unknown categories.
- **High computational efficiency**: NEAT does not need to train an additional detection network, so it is more computationally efficient.
- **Experimental verification**: The experimental results on multiple data sets show that NEAT is superior to existing active learning methods in selecting informative samples of known categories, and has significant improvements in accuracy, precision, and recall.
### Formulas and algorithms:
- **Clustering criteria**: NEAT uses feature similarity to identify samples of known categories. Specifically, if the K - nearest neighbors of an unlabeled sample all belong to known categories, then the sample is likely to also belong to a known category.
- **Consistency measure**: NEAT selects informative samples by calculating the consistency between the model prediction and the local feature distribution. The specific formula is as follows:
\[
I(x)=-\sum_{c = 1}^{C}P_x[c]\log\tilde{V}_x[c]
\]
where \(P_x\) is the prediction probability vector of the model for sample \(x\), and \(\tilde{V}_x\) is the K - nearest neighbor label vector after softmax normalization.
### Experimental results:
- **Accuracy**: NEAT improves by 6%, 11% and 10% respectively over existing methods on the CIFAR10, CIFAR100 and Tiny - ImageNet data sets.
- **Precision and recall**: NEAT also performs well in precision and recall, especially on the CIFAR100 data set, where the recall is improved by 12%.
In general, through proposing the NEAT method, this paper effectively solves the challenges of active learning in the open world, especially outstanding in dealing with unknown categories.