Abstract:Recently, the emergence of social media has opened the way for online harassment in the form of hate speech and offensive language. An automated approach is needed to detect hate and offensive content from social media, which is indispensable. This task is challenging in the case of social media posts or comments in low-resourced CodeMix languages. This paper investigates the efficacy of various multilingual transformer-based embedding models with machine learning classifiers for detecting hate speech and offensive language (HOS) content in social media posts in CodeMix Dravidian languages that belong to the low-resource language group. Experiments were conducted on six sets of openly available datasets in Kannada-English, Malayalam-English and Tamil-English languages. The objective is to identify a single pre-trained embedding model that commonly works well for HOS tasks in the above mentioned languages. For this, a comprehensive study of various multilingual transformer embedding models, such as BERT, DistilBERT, LaBSE, MuRIL, XLM, IndicBERT, and FNET for HOS detection was conducted. Our experiments revealed that MuRIL pre-trained embedding performed consistently well for all six datasets using Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel. In a set of experiments conducted on six datasets, the highest accuracy results for each dataset are as follows: DravidianLangTech 2021 achieved 96% accuracy for Malayalam, 72% accuracy for Tamil, and 66% accuracy for Kannada. For HASOC 2021 Tamil, the accuracy reached 76%, and for HASOC 2021 Malayalam, it reached 68%. Additionally, HASOC 2020 demonstrated an accuracy of 92% for Malayalam. Moreover, we performed an in-depth error analysis and a comparative study, presenting a tabulated summary of our work compared to other top-performing studies. In addition, we employed a cost-sensitive learning approach to address the class imbalance problem in the dataset, in which minority classes get higher classification weights than the majority classes. The weights were initialized and fine-tuned to obtain the best balance between all the classes. The results showed that incorporating the cost-sensitive learning strategy avoided class bias in the trained model. In addition to the aforementioned points, a significant contribution of our research presented in this paper is introducing a novel annotated test set for Malayalam-English CodeMix. This new dataset serves as an extension to our existing data, known as the Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC) 2021 Malayalam-English dataset.

Author Profiling in Code-Mixed WhatsApp Messages Using Stacked Convolution Networks and Contextualized Embedding Based Text Augmentation

Gender Prediction in English-Hindi Code-Mixed Social Media Content : Corpus and Baseline System

BN-AuthProf: Benchmarking Machine Learning for Bangla Author Profiling on Social Media Texts

Exploratory Data Analysis of WhatsApp group chat

A Word Embeddings based Approach for Author Profiling: Gender and Age Prediction

A Study of WhatsApp Usage Patterns and Prediction Models without Message Content

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Linguistic Analysis of Hindi-English Mixed Tweets for Depression Detection

Author Identification from Literary Articles with Visual Features: A Case Study with Bangla Documents

Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces

Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution

Authorship Identification in Bengali Literature: a Comparative Analysis

Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish

Detection of Homophobia & Transphobia in Dravidian Languages: Exploring Deep Learning Methods

Deep Learning Speech Synthesis Model for Word/Character-Level Recognition in the Tamil Language

User profiling using smartphone network traffic analysis

Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach

Characterising User Content on a Multi-lingual Social Network

Author Identity Unveiled: Gender and Age Prediction from Textual Patterns using BERT

IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages

NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Code-Mixed Dravidian text using XLNet