Abstract:NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the effectiveness of automatically generated counterfactually augmented data (CADs) in the task of harmful language detection, especially compared with manually generated CADs. Specifically, the research aims to: 1. **RQ1: Capabilities of different automatic CAD generation methods** Study the capabilities of different automatic CAD generation methods (such as Polyjuice, ChatGPT, and Flan - T5) to generate effective CADs, and whether these CADs can improve model performance. The study found that although manually generated CADs are still the most effective, ChatGPT - generated CADs are a close second. A key problem with automatic methods is that the changes they introduce are often not sufficient to flip the original label, resulting in inaccurate labels for the generated CADs. 2. **RQ2: Attributes of effective CADs** Explore the attributes that make CADs effective training data, including minimum change (Levenshtein edit distance), semantic similarity, edit types (such as adding or deleting negative words, gender/identity words, emotional words), etc. The study found that automatic CADs have significant differences from manual CADs in edit distance and semantic similarity, and these differences affect the effectiveness of CADs. ### Main Findings - **Model Performance** Models with manually generated CADs in the training data perform best on out - of - domain (OOD) datasets. ChatGPT - generated CADs are second - best, while Polyjuice - and Flan - T5 - generated CADs perform poorly. This indicates that manually generated CADs are more effective in improving the generalization ability of models. - **CAD Attribute Analysis** Through pointwise V - information (PVI) scores, the study found that training data containing CADs has a higher average PVI score on OOD datasets, which means that CADs help reduce the difficulty of OOD datasets, making them easier for the model to learn. ### Conclusions - **Advantages of Manual CADs** Manually generated CADs are still the most effective, especially in improving model performance on OOD datasets. However, ChatGPT - generated CADs also show good results, approaching the level of manually generated CADs. - **Challenges of Automatic CADs** The main problem with automatic CADs is that the generated changes are insufficient, resulting in unsuccessful label flipping. This requires further manual verification to ensure the label accuracy of CADs. - **Mixed Use of CADs** The mixed use of manual and automatic CADs can further improve model performance on OOD datasets, especially in gender - discrimination detection tasks. ### Experimental Setup - **Datasets** The study used multiple in - domain (ID) and out - of - domain (OOD) datasets, including datasets of gender discrimination and hate speech. - **Model Architectures** RoBERTa, Flan - T5, and SVM models were used for the experiment, and the effects of different types of CADs on model performance were compared. - **Evaluation Metrics** The macro F1 score was used as the main evaluation metric to evaluate the overall performance of the model. ### Dataset Difficulty Analysis - **PVI Score** Through PVI scores, the study found that training data containing CADs has a higher average PVI score on OOD datasets, which indicates that CADs help reduce the difficulty of OOD datasets, making them easier for the model to learn. ### Summary This study explored the effectiveness of manually and automatically generated CADs in the task of harmful language detection by comparison. The results show that although manually generated CADs are still the most effective, automatic CADs (especially ChatGPT - generated CADs) also have certain potential and can improve the generalization ability of models to a certain extent. Future research can further optimize the generation methods of automatic CADs to improve their effectiveness in practical applications.

People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection

How Does Counterfactually Augmented Data Impact Models for Social Computing Constructs?

A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals

Explaining The Efficacy of Counterfactually Augmented Data

Improving the Out-Of-Distribution Generalization Capability of Language Models: Counterfactually-Augmented Data is not Enough

Unlock the Potential of Counterfactually-Augmented Data in Out-Of-Distribution Generalization

Weigh Your Own Words: Improving Hate Speech Counter Narrative Generation via Attention Regularization

LLMs for Generating and Evaluating Counterfactuals: A Comprehensive Study

Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis.

A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs

Supporting Human Raters with the Detection of Harmful Content using Large Language Models

Enhanced Offensive Language Detection Through Data Augmentation

Unmasking the Imposters: How Censorship and Domain Adaptation Affect the Detection of Machine-Generated Tweets

Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models

When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

MisinfoEval: Generative AI in the Era of "Alternative Facts"

Generative AI for Hate Speech Detection: Evaluation and Findings

Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI

AutoCAD: Automatically Generating Counterfactuals for Mitigating Shortcut Learning

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews