Abstract:Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the stopping criterion problem in active learning in text classification. Specifically, the author explores using unlabeled data and labeled data as the basis for stopping the active learning process, and compares the advantages and disadvantages of these two methods. #### Background In text classification systems, the labeling of training data is a major bottleneck because manual labeling is time - consuming and costly. Active learning is a technique to reduce the amount of required labeled data and improve model performance by selecting the most valuable data for labeling. However, determining when to stop active learning is a crucial issue. If it stops too early, it may lead to poor model performance; if it stops too late, it will increase unnecessary labeling costs. #### Main problems In current research, there has not been a comprehensive comparison of stopping methods based on three information sources: unlabeled data, a small - scale labeled validation set, and data labeled during the training process. Therefore, this paper attempts to answer the following questions: - Which method is more effective when using unlabeled data and labeled data as the basis for stopping active learning? - Is it worth the extra cost of labeling data to determine the stopping point? #### Research purposes The main purpose of the paper is to compare the stopping methods based on unlabeled data and labeled data, evaluate their effects, and determine whether it is worth the extra cost of labeling data for the stopping criterion. The research results show that the stopping method using unlabeled data not only avoids additional labeling costs but also has better performance than the method using labeled data. #### Conclusions Through experimental verification, the author finds that: - The stopping method using unlabeled data (such as the SP method) is more effective than the stopping method using labeled data. - The cost of extra labeling data to determine the stopping point is not worthwhile because the unlabeled data method has already performed better. These conclusions provide important guidance for future applications of active learning in text classification tasks, helping researchers and practitioners better design and optimize active learning strategies.

The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Confidence-based stopping criteria for active learning for data annotation

Uncertainty-aware Complementary Label Queries for Active Learning

A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping

Stopping Criterion for Active Learning with Model Stability.

Effective Multi-Label Active Learning for Text Classification

Early Forecasting of Text Classification Accuracy and F-Measure with Active Learning

Making Better Use of Unlabelled Data in Bayesian Active Learning

Learning to Label with Active Learning and Reinforcement Learning.

Unlabeled data selection for active learning in image classification

The Use of Unlabeled Data in Predictive Modeling

Active Learning: Problem Settings and Recent Developments

ActiveLab: Active Learning with Re-Labeling by Multiple Annotators

Effective Active Learning Strategies for the Use of Large-Margin Classifiers in Semantic Annotation: an Optimal Parameter Discovery Perspective.

Active Learning Based on Transfer Learning Techniques for Text Classification

Stability-Based Stopping Criterion for Active Learning

Analysis of Stopping Active Learning based on Stabilizing Predictions

Multi-domain active learning for text classification.

Comparing Visual-Interactive Labeling with Active Learning: An Experimental Study

Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

Enhancing Text Classification through LLM-Driven Active Learning and Human Annotation