Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation

Liang Wu,Fangfang Zhang,Chao Cheng,Shinan Song

DOI: https://doi.org/10.1145/3653300

IF: 1.471

2024-03-19

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract:Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augment (DQA). First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.

computer science, artificial intelligence

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper proposes a supervised contrastive learning text classification model based on data quality enhancement (DQA). The main aim is to address issues in existing text data augmentation methods when handling keywords, thereby improving the accuracy of text classification tasks. Specifically, the paper addresses the following issues: 1. **Keywords Not Considered**: Existing augmentation methods like EDA do not consider the role of keywords during random operations, which may result in the deletion of keywords that reflect the sentence's semantics. 2. **Low-Quality Samples Affect Model Training**: Difficult-to-classify data samples can negatively impact the model. 3. **Inefficiency of Data Augmentation Methods**: Current data augmentation techniques are inefficient in extracting keywords and require recalculations each time the dataset is updated. To overcome these issues, the paper proposes the following methods: 1. **Filtering High-Quality Data**: Dynamically training to filter out high-quality datasets that contain beneficial information. 2. **Keyword-Based Data Augmentation**: Enhancing the filtered data based on important words from label information. 3. **Supervised Contrastive Loss Training**: Using standard supervised contrastive loss to train the model for better text representation. With these improvements, the model can avoid overfitting and enhance the accuracy of text classification tasks.

Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation from Scratch

Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification

Contrastive learning with text augmentation for text classification

Differentiable Data Augmentation for Contrastive Sentence Representation Learning

Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy

Unsupervised Document Embedding via Contrastive Augmentation

What Have Been Learned & What Should Be Learned? An Empirical Study of How to Selectively Augment Text for Classification

Time Series Classification Based on Data-Augmented Contrastive Learning.

Unlocking the Potential of Data Augmentation in Contrastive Learning for Hyperspectral Image Classification

Exploring ChatGPT-based Augmentation Strategies for Contrastive Aspect-based Sentiment Analysis

DAGAM: Data Augmentation with Generation And Modification

Feature Augmentation for Self-supervised Contrastive Learning: A Closer Look

Heterogeneous graph contrastive learning with adaptive data augmentation for semi‐supervised short text classification

Improving Text Classification with Large Language Model-Based Data Augmentation

Adaptive Data Augmentation for Contrastive Learning

TextANN: An Improved Text Classification Model Based on Data Augmentation

Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models

Solving Data Imbalance in Text Classification with Constructing Contrastive Samples

Constructing Contrastive Samples Via Summarization for Text Classification with Limited Annotations

Compounds of the Methanolic Leaf Extract as Chemotaxonomic Markers for the Campanula Pyramidalis Complex (Campanulaceae)