Abstract:The work presented in this paper focuses on the use of data augmentation techniques applied in the domain of the detection of antisocial behavior. Data augmentation is a frequently used approach to overcome issues related to the lack of data or problems related to imbalanced classes. Such techniques are used to generate artificial data samples used to improve the volume of the training set or to balance the target distribution. In the antisocial behavior detection domain, we frequently face both issues, the lack of quality labeled data as well as class imbalance. As the majority of the data in this domain is textual, we must consider augmentation methods suitable for NLP tasks. Easy data augmentation (EDA) represents a group of such methods utilizing simple text transformations to create the new, artificial samples. Our main motivation is to explore EDA techniques' usability on the selected tasks from the antisocial behavior detection domain. We focus on the class imbalance problem and apply EDA techniques to two problems: fake news and toxic comments classification. In both cases, we train the convolutional neural networks classifier and compare its performance on the original and EDA-extended datasets. EDA techniques prove to be very task-dependent, with certain limitations resulting from the data they are applied on. The model's performance on the extended toxic comments dataset did improve only marginally, gaining only 0.01 improvement in the F1 metric when applying only a subset of EDA methods. EDA techniques in this case were not suitable enough to handle texts written in more informal language. On the other hand, on the fake news dataset, the performance was improved more significantly, boosting the F1 score by 0.1. Improvement was most significant in the prediction of the minor class, where F1 improved from 0.67 to 0.86.

Investigating the Impact of Semi-Supervised Methods with Data Augmentation on Offensive Language Detection in Romanian Language

Enhancing Romanian Offensive Language Detection through Knowledge Distillation, Multi-Task Learning, and Data Augmentation

Investigating cross-lingual training for offensive language detection

Offensive-Language Detection on Multi-Semantic Fusion Based on Data Augmentation

UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection on Social Media by Fine-tuning a Variety of BERT-based Models

Leveraging external resources for offensive content detection in social media

Text Data Augmentation Techniques for Fake News Detection in the Romanian Language

Enhanced Offensive Language Detection Through Data Augmentation

Semantic Change Detection for the Romanian Language

Multilingual Hate Speech Detection: A Semi-Supervised Generative Adversarial Approach

Unsupervised Domain Adaptation for Hate Speech Detection Using a Data Augmentation Approach

A Comprehensive Study on NLP Data Augmentation for Hate Speech Detection: Legacy Methods, BERT, and LLMs

Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation

Neural Models for Offensive Language Detection

Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection

Exploring Data Augmentation Methods on Social Media Corpora

A Target-Aware Analysis of Data Augmentation for Hate Speech Detection

Adversarial Capsule Networks for Romanian Satire Detection and Sentiment Analysis

Use of Data Augmentation Techniques in Detection of Antisocial Behavior Using Deep Learning Methods

Cross-lingual offensive speech identification with transfer learning for low-resource languages

Unsupervised offensive speech detection for multimedia based on multilingual BERT