Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation

Liang Wu,Fangfang Zhang,Chao Cheng,Shinan Song
DOI: https://doi.org/10.1145/3653300
IF: 1.471
2024-03-19
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augment (DQA). First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.
computer science, artificial intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper proposes a supervised contrastive learning text classification model based on data quality enhancement (DQA). The main aim is to address issues in existing text data augmentation methods when handling keywords, thereby improving the accuracy of text classification tasks. Specifically, the paper addresses the following issues: 1. **Keywords Not Considered**: Existing augmentation methods like EDA do not consider the role of keywords during random operations, which may result in the deletion of keywords that reflect the sentence's semantics. 2. **Low-Quality Samples Affect Model Training**: Difficult-to-classify data samples can negatively impact the model. 3. **Inefficiency of Data Augmentation Methods**: Current data augmentation techniques are inefficient in extracting keywords and require recalculations each time the dataset is updated. To overcome these issues, the paper proposes the following methods: 1. **Filtering High-Quality Data**: Dynamically training to filter out high-quality datasets that contain beneficial information. 2. **Keyword-Based Data Augmentation**: Enhancing the filtered data based on important words from label information. 3. **Supervised Contrastive Loss Training**: Using standard supervised contrastive loss to train the model for better text representation. With these improvements, the model can avoid overfitting and enhance the accuracy of text classification tasks.