User-Aware Multilingual Abusive Content Detection in Social Media

Mohammad Zia Ur Rehman,Somya Mehta,Kuldeep Singh,Kunal Kaushik,Nagendra Kumar
DOI: https://doi.org/10.1016/j.ipm.2023.103450
2024-10-26
Abstract:Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low-resource Indic languages. Our observation indicates that a post's tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in F1-scores on SCIDN and MACI datasets, respectively.
Social and Information Networks,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the detection of abusive content in low - resource Indian languages in multilingual social media. Specifically, the paper focuses on the following challenges: 1. **Inconsistent spellings**: Due to the use of the Roman alphabet to write multiple languages, the spelling structures are diverse, especially in low - resource languages, which increases the difficulty of detecting abusive content. 2. **Different syntactic structures of code - mixed languages**: Social media users often use content mixed with multiple languages in their posts. The grammatical structures of such code - mixed texts may vary from user to user, posing challenges to the identification of abusive content. 3. **Over - reliance on text features**: Many existing studies mainly focus on text features and ignore the value of social context features (such as the number of likes and reports), which can provide useful information about the nature of the content. 4. **Widespread multilingual use by Indian social media users**: India is a multilingual country, and social media users frequently use multiple languages, especially low - resource Indian languages. Therefore, an effective method is needed to detect abusive content in these languages. To address these challenges, the paper proposes a new method that combines social context features and text features and is trained through two independent modules to create high - level feature representations. In addition, the paper also proposes a cross - language training Transformer - based method for extracting situational embeddings of user - generated content. Experimental results show that this method outperforms existing state - of - the - art methods on multiple datasets.