A Cross-Domain Benchmark for Active Learning

Thorben Werner,Johannes Burchert,Maximilian Stubbemann,Lars Schmidt-Thieme
2024-08-01
Abstract:Active Learning (AL) deals with identifying the most informative samples for labeling to reduce data annotation costs for supervised learning tasks. AL research suffers from the fact that lifts from literature generalize poorly and that only a small number of repetitions of experiments are conducted. To overcome these obstacles, we propose \emph{CDALBench}, the first active learning benchmark which includes tasks in computer vision, natural language processing and tabular learning. Furthermore, by providing an efficient, greedy oracle, \emph{CDALBench} can be evaluated with 50 runs for each experiment. We show, that both the cross-domain character and a large amount of repetitions are crucial for sophisticated evaluation of AL research. Concretely, we show that the superiority of specific methods varies over the different domains, making it important to evaluate Active Learning with a cross-domain benchmark. Additionally, we show that having a large amount of runs is crucial. With only conducting three runs as often done in the literature, the superiority of specific methods can strongly vary with the specific runs. This effect is so strong, that, depending on the seed, even a well-established method's performance can be significantly better and significantly worse than random for the same dataset.
Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue that existing methods in Active Learning (AL) research have poor generalization capabilities across different domains, and the reliability of results is low due to insufficient experimental repetitions. Specifically: 1. **Domain Generalization Issue**: Existing active learning methods are often evaluated only in a specific domain (such as computer vision or natural language processing), making it difficult to generalize their performance to other domains. Therefore, a cross-domain benchmark is needed to evaluate the performance of these methods in different application areas. 2. **Insufficient Experimental Repetitions**: Many studies, due to computational resource limitations, typically conduct only a small number of experimental repetitions (e.g., 3 times), leading to high variance in results and making it difficult to draw meaningful conclusions. For example, some methods may perform worse than random selection under certain random seeds, while significantly outperforming random selection in other cases. To address these issues, the authors propose **CDALBench**, a benchmark framework for active learning that includes multiple domains such as computer vision, natural language processing, and tabular data. By providing a large number of experimental repetitions (50 times per experiment), CDALBench can more accurately evaluate the performance of different active learning methods and reveal performance differences across different domains. Additionally, the authors propose an efficient greedy algorithm to approximate the optimal solution (oracle), further improving the accuracy of the evaluation.