BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

Heng Yang,Ke Li

2024-04-01

Abstract:Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.

Computation and Language

What problem does this paper attempt to address?

The problem this paper attempts to address is the performance degradation of existing text augmentation methods on large-scale public datasets. Specifically, while current text augmentation techniques perform well in small sample scenarios, they often generate instances with feature space shifts on large-scale datasets, leading to a decline in model performance. For example, EDA (a common text augmentation method) typically loses about 2% performance in sentiment classification tasks. To solve this problem, the authors propose a hybrid instance filtering framework based on pre-trained language models (BOOSTAUG), aiming to maintain a feature space similar to natural datasets, thereby improving the effectiveness of text augmentation. Experimental results show that BOOSTAUG significantly improves the performance of existing text augmentation methods on three classification tasks and nine public datasets, increasing classification accuracy by approximately 2-3%. Additionally, the authors have released the code to help improve the performance of existing augmentation methods on large-scale datasets.

BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation from Scratch

Back-Modality: Leveraging Modal Transformation for Data Augmentation.

Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation

Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification

Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification

DualAug: Exploiting Additional Heavy Augmentation with OOD Data Rejection

AugGPT: Leveraging ChatGPT for Text Data Augmentation

DAGAM: Data Augmentation with Generation And Modification

EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification

AdaAugment: A Tuning-Free and Adaptive Approach to Enhance Data Augmentation

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

KeepAugment: A Simple Information-Preserving Data Augmentation Approach

Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition

Text Augmentation in a Multi-Task View

MixGen: A New Multi-Modal Data Augmentation

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

ProAug: Prototype-Based Augmentation for Long-Tailed Image Classification.

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

AutoAugment Is What You Need: Enhancing Rule-based Augmentation Methods in Low-resource Regimes

Boosting Model Resilience via Implicit Adversarial Data Augmentation