Combining multiple pre-trained models for hate speech detection in Bengali, Marathi, and Hindi
Arpan Nandi,Kamal Sarkar,Arjun Mallick,Arkadeep De
DOI: https://doi.org/10.1007/s11042-023-17934-x
IF: 2.577
2024-02-28
Multimedia Tools and Applications
Abstract:With the increasing practice of using regional languages in social media platforms, hate speech detection in regional languages has received the attention of researchers. In India, hundreds of languages are spoken in various forms, which are dependent on their geography, culture, etc. Recently the number of active internet users has been rapidly increasing in India, and therefore social media has penetrated the common Indian population. Though the need for proper detection and timely removal of abusive or offensive texts has increased, well-organized and labeled data for Indian languages are scarce. Almost all the regional languages in India are low-resource languages. Hence, the objective of this study is to develop an approach that will learn from relatively small volumes of Indian language data and provide state-of-the-art results. A fusion of features extracted from a fined-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers) and a fine-tuned Indic BERT has been proposed in this study. Since the BERT models that we have used for this work are pre-trained using a large volume of texts in multiple Indian languages, transfer learning solves the problem of low training data volume, and this makes the proposed model more generic. Three datasets for three different Indian languages namely, Bengali, Marathi, and Hindi have been considered in this study to evaluate the proposed approach. The proposed model achieved a weighted F1 score of 0.923, 0.815, and 0.924 for the Bengali, Hindi, and Marathi datasets respectively. In the Bengali and Marathi datasets, the obtained results are better than the existing best results.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering