The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Garrett Beatty,Ethan Kochis,Michael Bloodgood
DOI: https://doi.org/10.1109/ICOSC.2019.8665546
2019-04-23
Abstract:Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data.
Machine Learning,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the stopping criterion problem in active learning in text classification. Specifically, the author explores using unlabeled data and labeled data as the basis for stopping the active learning process, and compares the advantages and disadvantages of these two methods. #### Background In text classification systems, the labeling of training data is a major bottleneck because manual labeling is time - consuming and costly. Active learning is a technique to reduce the amount of required labeled data and improve model performance by selecting the most valuable data for labeling. However, determining when to stop active learning is a crucial issue. If it stops too early, it may lead to poor model performance; if it stops too late, it will increase unnecessary labeling costs. #### Main problems In current research, there has not been a comprehensive comparison of stopping methods based on three information sources: unlabeled data, a small - scale labeled validation set, and data labeled during the training process. Therefore, this paper attempts to answer the following questions: - Which method is more effective when using unlabeled data and labeled data as the basis for stopping active learning? - Is it worth the extra cost of labeling data to determine the stopping point? #### Research purposes The main purpose of the paper is to compare the stopping methods based on unlabeled data and labeled data, evaluate their effects, and determine whether it is worth the extra cost of labeling data for the stopping criterion. The research results show that the stopping method using unlabeled data not only avoids additional labeling costs but also has better performance than the method using labeled data. #### Conclusions Through experimental verification, the author finds that: - The stopping method using unlabeled data (such as the SP method) is more effective than the stopping method using labeled data. - The cost of extra labeling data to determine the stopping point is not worthwhile because the unlabeled data method has already performed better. These conclusions provide important guidance for future applications of active learning in text classification tasks, helping researchers and practitioners better design and optimize active learning strategies.