People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection

Indira Sen,Dennis Assenmacher,Mattia Samory,Isabelle Augenstein,Wil van der Aalst,Claudia Wagner
2024-02-25
Abstract:NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
Computation and Language,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the effectiveness of automatically generated counterfactually augmented data (CADs) in the task of harmful language detection, especially compared with manually generated CADs. Specifically, the research aims to: 1. **RQ1: Capabilities of different automatic CAD generation methods** Study the capabilities of different automatic CAD generation methods (such as Polyjuice, ChatGPT, and Flan - T5) to generate effective CADs, and whether these CADs can improve model performance. The study found that although manually generated CADs are still the most effective, ChatGPT - generated CADs are a close second. A key problem with automatic methods is that the changes they introduce are often not sufficient to flip the original label, resulting in inaccurate labels for the generated CADs. 2. **RQ2: Attributes of effective CADs** Explore the attributes that make CADs effective training data, including minimum change (Levenshtein edit distance), semantic similarity, edit types (such as adding or deleting negative words, gender/identity words, emotional words), etc. The study found that automatic CADs have significant differences from manual CADs in edit distance and semantic similarity, and these differences affect the effectiveness of CADs. ### Main Findings - **Model Performance** Models with manually generated CADs in the training data perform best on out - of - domain (OOD) datasets. ChatGPT - generated CADs are second - best, while Polyjuice - and Flan - T5 - generated CADs perform poorly. This indicates that manually generated CADs are more effective in improving the generalization ability of models. - **CAD Attribute Analysis** Through pointwise V - information (PVI) scores, the study found that training data containing CADs has a higher average PVI score on OOD datasets, which means that CADs help reduce the difficulty of OOD datasets, making them easier for the model to learn. ### Conclusions - **Advantages of Manual CADs** Manually generated CADs are still the most effective, especially in improving model performance on OOD datasets. However, ChatGPT - generated CADs also show good results, approaching the level of manually generated CADs. - **Challenges of Automatic CADs** The main problem with automatic CADs is that the generated changes are insufficient, resulting in unsuccessful label flipping. This requires further manual verification to ensure the label accuracy of CADs. - **Mixed Use of CADs** The mixed use of manual and automatic CADs can further improve model performance on OOD datasets, especially in gender - discrimination detection tasks. ### Experimental Setup - **Datasets** The study used multiple in - domain (ID) and out - of - domain (OOD) datasets, including datasets of gender discrimination and hate speech. - **Model Architectures** RoBERTa, Flan - T5, and SVM models were used for the experiment, and the effects of different types of CADs on model performance were compared. - **Evaluation Metrics** The macro F1 score was used as the main evaluation metric to evaluate the overall performance of the model. ### Dataset Difficulty Analysis - **PVI Score** Through PVI scores, the study found that training data containing CADs has a higher average PVI score on OOD datasets, which indicates that CADs help reduce the difficulty of OOD datasets, making them easier for the model to learn. ### Summary This study explored the effectiveness of manually and automatically generated CADs in the task of harmful language detection by comparison. The results show that although manually generated CADs are still the most effective, automatic CADs (especially ChatGPT - generated CADs) also have certain potential and can improve the generalization ability of models to a certain extent. Future research can further optimize the generation methods of automatic CADs to improve their effectiveness in practical applications.