Abstract:With the increasing practice of using regional languages in social media platforms, hate speech detection in regional languages has received the attention of researchers. In India, hundreds of languages are spoken in various forms, which are dependent on their geography, culture, etc. Recently the number of active internet users has been rapidly increasing in India, and therefore social media has penetrated the common Indian population. Though the need for proper detection and timely removal of abusive or offensive texts has increased, well-organized and labeled data for Indian languages are scarce. Almost all the regional languages in India are low-resource languages. Hence, the objective of this study is to develop an approach that will learn from relatively small volumes of Indian language data and provide state-of-the-art results. A fusion of features extracted from a fined-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers) and a fine-tuned Indic BERT has been proposed in this study. Since the BERT models that we have used for this work are pre-trained using a large volume of texts in multiple Indian languages, transfer learning solves the problem of low training data volume, and this makes the proposed model more generic. Three datasets for three different Indian languages namely, Bengali, Marathi, and Hindi have been considered in this study to evaluate the proposed approach. The proposed model achieved a weighted F1 score of 0.923, 0.815, and 0.924 for the Bengali, Hindi, and Marathi datasets respectively. In the Bengali and Marathi datasets, the obtained results are better than the existing best results.

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

L3Cube-MahaNews: News-based Short Text and Long Document Classification Datasets in Marathi

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition

Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages

Development of Pre-Trained Transformer-based Models for the Nepali Language

IndicBART: A Pre-trained Model for Indic Natural Language Generation

San-BERT: Extractive Summarization for Sanskrit Documents using BERT and it's variants

NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech Detection using Ensembling of BERT-based models

Combining multiple pre-trained models for hate speech detection in Bengali, Marathi, and Hindi

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Language Identification of Hindi-English tweets using code-mixed BERT

BERT Based Multilingual Machine Comprehension in English and Hindi