Deep Active Learning over the Long Tail

Yonatan Geifman,Ran El-Yaniv
DOI: https://doi.org/10.48550/arXiv.1711.00941
2017-11-03
Abstract:This paper is concerned with pool-based active learning for deep neural networks. Motivated by coreset dataset compression ideas, we present a novel active learning algorithm that queries consecutive points from the pool using farthest-first traversals in the space of neural activation over a representation layer. We show consistent and overwhelming improvement in sample complexity over passive learning (random sampling) for three datasets: MNIST, CIFAR-10, and CIFAR-100. In addition, our algorithm outperforms the traditional uncertainty sampling technique (obtained using softmax activations), and we identify cases where uncertainty sampling is only slightly better than random sampling.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the active learning problem in deep neural networks, especially how to effectively select unlabeled samples for annotation in pool - based active learning to improve the model performance. Specifically, the author focuses on how to further optimize the model through active learning on the basis of an initially trained deep neural network, thereby reducing the amount of required annotation data and improving the generalization ability of the model. ### Problem Background 1. **Basic Concepts of Active Learning** Active learning is a machine - learning method that allows the algorithm to select the most useful data points for annotation during the learning process. This is different from passive learning (such as random sampling), which does not consider the selection strategy of data points. 2. **Existing Challenges** - **Hyper - parameter Selection**: Without prior knowledge, selecting appropriate hyper - parameters is a difficult problem, especially in the early stage when there is less available annotated data and it is biased. - **Sample Complexity**: Traditional methods such as uncertainty sampling are only slightly better than random sampling in some cases and cannot significantly reduce the number of required annotated samples. ### Main Contributions of the Paper 1. **Proposing a New Active Learning Algorithm** This algorithm is based on the idea of "farthest - first traversal" and selects the points farthest from the currently labeled samples in the space of the representation layer for annotation. This method aims to explore the areas in the data space that are not fully covered yet. 2. **Solving the Long - Tail Problem** The author proposes a "long - tail" variant of the pool - based active learning setting, that is, gradually increasing the annotated data to further improve the model performance after the model has been initially trained to a reasonable accuracy. This setting avoids inefficient annotation in the early stage and focuses on the improvement in the later stage. 3. **Experimental Verification** Through experiments on three datasets, MNIST, CIFAR - 10 and CIFAR - 100, it is proved that the new algorithm is significantly superior to random sampling and traditional uncertainty sampling methods in sample complexity. ### Summary of Mathematical Formulas - **Risk Minimization Objective** \[ R_\ell(f)=\mathbb{E}_{(X, Y)\sim P(X, Y)}[\ell(f(x), y)] \] where \( f \) is the prediction function and \( \ell \) is the loss function. - **Farthest - First Traversal Selection Formula** \[ (x', y')=\arg\max_{(x', y')\in U}\min_{(x, y)\in L_{t - 1}\cup S_b}d(\phi(x'), \phi(x)) \] where \( \phi(x) \) represents the activation value of input \( x \) in the representation layer, and \( d(u, v)=\|u - v\|_2 \) is the Euclidean distance. ### Conclusion This paper proposes a new pool - based active learning algorithm by introducing the farthest - first traversal strategy based on the representation layer, which significantly improves the sample complexity efficiency of deep neural networks, especially on large - scale datasets. This method provides new ideas and tools for future deep active learning research.