Active Learning Over Multiple Domains in Natural Language Tasks

Shayne Longpre,Julia Reisler,Edward Greg Huang,Yi Lu,Andrew Frank,Nikhil Ramesh,Chris DuBois
DOI: https://doi.org/10.48550/arXiv.2202.00254
2022-02-08
Abstract:Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. We ask (1) what family of methods are effective for this task? And, (2) what properties of selected examples and domains achieve strong results? Among 18 acquisition functions from 4 families of methods, we find H-Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. We also show the importance of a diverse allocation of domains, as well as room-for-improvement of existing methods on both domain and example selection. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the active learning problem under multi - source domain data (i.e., data from different distributions) in natural language tasks. Traditionally, active learning research assumes that the target data and the source data come from the same domain. However, in practical applications, practitioners often need to handle non - identically - distributed data from multiple sources, which may have positive or negative impacts on the performance in the target domain. Therefore, the paper explores the following two main problems: 1. **Which method categories are effective for multi - domain active learning?** 2. **Which properties of example and domain selection can achieve better results?** ### Research background In natural language processing (NLP), new natural language problems are often restricted by the scarcity of labeled data. Although unlabeled data is easily accessible, its source may be different from the target distribution. This is especially common in tasks such as significant distribution changes over time, personalized needs of user subgroups, and different data collection media. To solve this problem, active learning (AL) is usually adopted to guide the construction of a larger training set, that is, to decide which unlabeled training samples should be labeled under a fixed labeling budget. However, most of the active learning literature in NLP assumes that the unlabeled source data and the target data come from the same distribution. This simplified assumption ignores the frequent challenges faced by multi - domain active learning in practical applications. ### Research methods The paper experimentally compares the performance of four method categories (uncertainty methods, H - Divergence methods, reverse classification accuracy methods, and semantic similarity detection methods) on multiple question - answering and sentiment analysis datasets to provide practical guidance for multi - domain active learning. Specifically: - **Uncertainty methods**: such as Confidence, Entropy, Energy - based Out - of - Distribution Detection, Bayesian Active Learning by Disagreement (BALD), etc. - **H - Divergence methods**: such as Discriminative Active Learning (DAL) and its variants (such as DAL - E). - **Reverse classification accuracy methods**: such as Reverse Classification Accuracy (RCA) and its smoothed variant (RCA - Smoothed, ˜RCA). - **Semantic similarity detection methods**: such as Nearest Neighbor (KNN). ### Main findings 1. **H - Divergence methods**, especially the DAL - E variant proposed by the author, perform excellently in multi - domain active learning, with an average improvement of 2 - 3% over the random baseline. 2. **Diversity allocation**: The diversity of domains is crucial for achieving strong results. 3. **Room for improvement of existing methods**: Existing methods still have room for improvement in domain selection and example selection. 4. **Orthogonality of different method categories**: Different method categories rely on different concepts of relevance to rank the relevance of examples, indicating the potential of combining different method categories. ### Conclusion This paper provides the first comprehensive analysis of existing and new methods in multi - domain active learning and provides specific guidance for practitioners facing this challenge. In particular, H - Divergence methods (such as DAL - E) perform excellently in multi - domain active learning, and the diversity of domains is an important factor in achieving strong results. These findings help guide practitioners to select appropriate methods and improve the performance of multi - domain active learning in natural language tasks.