Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features

Gayathri G L,Krithika Swaminathan,Divyasri Krishnakumar,Thenmozhi D,Bharathi B
DOI: https://doi.org/10.1145/3664619
IF: 1.471
2024-05-18
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract:In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.
computer science, artificial intelligence
What problem does this paper attempt to address?