Detecting Aggression in Language: From Diverse Data to Robust Classifiers

Aleksander Wawer,Agnieszka Mykowiecka,Bartosz Żuk
DOI: https://doi.org/10.3390/electronics13244857
IF: 2.9
2024-12-11
Electronics
Abstract:The automatic detection of aggressive language is a difficult challenge. Currently, three datasets are available in Polish, enabling the training of machine learning models to recognise different types of linguistic aggression. In this paper, we address the issues of the transferability of knowledge between datasets and training a single model that works best on all types of aggression. Due to data imbalance, we experiment with two loss functions dedicated to training on imbalanced data: Weighted Cross-Entropy and Focal loss. Using the Polish language HerBERT model, we present the results of experiments in the Cross-dataset scenario and the model results using the combined data. Our results show that (1) combining diverse types of linguistic aggression during training leads to a better-performing classifier and (2) Weighted Cross-Entropy outperforms other tested loss functions.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?