Abstract:Recently, the emergence of social media has opened the way for online harassment in the form of hate speech and offensive language. An automated approach is needed to detect hate and offensive content from social media, which is indispensable. This task is challenging in the case of social media posts or comments in low-resourced CodeMix languages. This paper investigates the efficacy of various multilingual transformer-based embedding models with machine learning classifiers for detecting hate speech and offensive language (HOS) content in social media posts in CodeMix Dravidian languages that belong to the low-resource language group. Experiments were conducted on six sets of openly available datasets in Kannada-English, Malayalam-English and Tamil-English languages. The objective is to identify a single pre-trained embedding model that commonly works well for HOS tasks in the above mentioned languages. For this, a comprehensive study of various multilingual transformer embedding models, such as BERT, DistilBERT, LaBSE, MuRIL, XLM, IndicBERT, and FNET for HOS detection was conducted. Our experiments revealed that MuRIL pre-trained embedding performed consistently well for all six datasets using Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel. In a set of experiments conducted on six datasets, the highest accuracy results for each dataset are as follows: DravidianLangTech 2021 achieved 96% accuracy for Malayalam, 72% accuracy for Tamil, and 66% accuracy for Kannada. For HASOC 2021 Tamil, the accuracy reached 76%, and for HASOC 2021 Malayalam, it reached 68%. Additionally, HASOC 2020 demonstrated an accuracy of 92% for Malayalam. Moreover, we performed an in-depth error analysis and a comparative study, presenting a tabulated summary of our work compared to other top-performing studies. In addition, we employed a cost-sensitive learning approach to address the class imbalance problem in the dataset, in which minority classes get higher classification weights than the majority classes. The weights were initialized and fine-tuned to obtain the best balance between all the classes. The results showed that incorporating the cost-sensitive learning strategy avoided class bias in the trained model. In addition to the aforementioned points, a significant contribution of our research presented in this paper is introducing a novel annotated test set for Malayalam-English CodeMix. This new dataset serves as an extension to our existing data, known as the Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC) 2021 Malayalam-English dataset.

Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features

Optimize_Prime@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil

Abusive Language Detection in Online User Content

Investigating Bias In Automatic Toxic Comment Detection: An Empirical Study

Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention

Large scale annotated dataset for code-mix abusive short noisy text

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

Detection of Homophobia & Transphobia in Dravidian Languages: Exploring Deep Learning Methods

User-Aware Multilingual Abusive Content Detection in Social Media

Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data

Development of an Efficient Method to Detect Mixed Social Media Data with Tamil-English Code Using Machine Learning Techniques

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Abusive Bangla comments detection on Facebook using transformer-based deep learning models

Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces

An Automated Toxicity Classification on Social Media Using LSTM and Word Embedding

Toxicity Detection for Indic Multilingual Social Media Content

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

Purging the Poison: A Machine Learning Approach to Filtering Toxic Comments

YouTube Comments Decoded: Leveraging LLMs for Low Resource Language Classification

Exploratory Data Analysis on Code-mixed Misogynistic Comments