Abstract:This paper addresses the problem of selecting of a set of texts for annotation in text classification using retrieval methods when there are limits on the number of annotations due to constraints on human resources. An additional challenge addressed is dealing with binary categories that have a small number of positive instances, reflecting severe class imbalance. In our situation, where annotation occurs over a long time period, the selection of texts to be annotated can be made in batches, with previous annotations guiding the choice of the next set. To address these challenges, the paper proposes leveraging SHAP to construct a quality set of queries for Elasticsearch and semantic search, to try to identify optimal sets of texts for annotation that will help with class imbalance. The approach is tested on sets of cue texts describing possible future events, constructed by participants involved in studies aimed to help with the management of obesity and diabetes. We introduce an effective method for selecting a small set of texts for annotation and building high-quality classifiers. We integrate vector search, semantic search, and machine learning classifiers to yield a good solution. Our experiments demonstrate improved F1 scores for the minority classes in binary classification.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the class - imbalance problem in text classification caused by limited annotation resources, especially in the case where the number of positive - class samples is small in binary - classification tasks. Specifically, the paper focuses on the following two main problems: 1. **Limited annotation resources**: Due to the limitation of human resources, the number of texts that can be annotated is limited. Therefore, how to select the texts that are most helpful to improve the classifier performance for annotation under limited annotation resources is a key issue. 2. **Class imbalance**: In some binary - classification tasks, the number of positive - class samples is far less than that of negative - class samples, which will cause the classifier to be biased towards the majority class, resulting in poor prediction performance for the minority class. The paper proposes a retrieval - based method to identify and select more texts related to the minority class for annotation to alleviate the class - imbalance problem. ### Solutions To solve the above problems, the paper proposes the following methods: - **Using retrieval models to select texts to be annotated**: By combining Elasticsearch and semantic search (Semantic Search), texts that may belong to the minority class are retrieved from the unannotated text pool. These retrieved texts will be preferentially annotated to help balance the class distribution in the training dataset. - **Using SHAP to construct high - quality queries**: By analyzing the classifier trained with the initial annotation data, the SHAP (SHAPley Additive exPlanations) method is used to extract keywords that have an important impact on the classifier prediction and construct queries for each class. These queries are used to guide Elasticsearch and semantic search to more accurately find potential minority - class samples. - **Selecting annotated texts in batches**: Since the annotation process may last for a long time, the paper proposes to select texts to be annotated in batches. The selection of each batch will refer to the previously annotated data, gradually optimize the selection strategy of subsequent batches, and ensure that the selected texts can maximize the improvement of the classifier performance each time. ### Experimental verification The paper conducted experiments on a dataset of 11,000 texts describing future events. These texts are from multiple medical research projects and involve health behaviors such as diabetes and obesity. The experimental results show that the annotated texts selected by the above methods can significantly improve the F1 score of the minority class, thereby improving the overall performance of the classifier. ### Summary By combining retrieval techniques and machine - learning models, this paper proposes an effective method to deal with the class - imbalance problem in text classification, especially in the case of limited annotation resources. Through this method, annotated texts can be selected more efficiently, a more balanced training dataset can be constructed, and then the performance of the classifier can be improved.

Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

Improving Short Text Classification Through Better Feature Space Selection

Feature Selection Method on Imbalanced Text

Improving short text classification using public search engines

A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification

Mining biomarker information in biomedical literature

Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques

Addressing Class Imbalance in Healthcare Data: Machine Learning Solutions for Age-Related Macular Degeneration and Preeclampsia

A discriminative model selection approach and its application to text classification

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

ForesTexter: an Efficient Random Forest Algorithm for Imbalanced Text Categorization

A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare

Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models

Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study

Sample and feature selecting based ensemble learning for imbalanced problems

A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

Adapting Feature Selection Algorithms for the Classification of Chinese Texts

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Comparative Analysis of Text Classification Approaches in Electronic Health Records

New Feature Selection Approach for Imbalanced Text Classification