Abstract:Exponential growths of social media and micro-blogging sites not only provide platforms for empowering freedom of expressions and individual voices but also enables people to express anti-social behaviour like online harassment, cyberbullying, and hate speech. Numerous works have been proposed to utilize these data for social and anti-social behaviours analysis, document characterization, and sentiment analysis by predicting the contexts mostly for highly resourced languages such as English. However, there are languages that are under-resources, e.g., South Asian languages like Bengali, Tamil, Assamese, Telugu that lack of computational resources for the NLP tasks. In this paper, we provide several classification benchmarks for Bengali, an under-resourced language. We prepared three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively. We built the largest Bengali word embedding models to date based on 250 million articles, which we call BengFastText. We perform three different experiments, covering document classification, sentiment analysis, and hate speech detection. We incorporate word embeddings into a Multichannel Convolutional-LSTM (MConv-LSTM) network for predicting different types of hate speech, document classification, and sentiment analysis. Experiments demonstrate that BengFastText can capture the semantics of words from respective contexts correctly. Evaluations against several baseline embedding models, e.g., Word2Vec and GloVe yield up to 92.30%, 82.25%, and 90.45% F1-scores in case of document classification, sentiment analysis, and hate speech detection, respectively during 5-fold cross-validation tests.

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages

Analyzing Roles of Classifiers and Code-Mixed factors for Sentiment Identification

OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

BAN-ABSA: An Aspect-Based Sentiment Analysis dataset for Bengali and it's baseline evaluation

Sentiment analysis in Bengali via transfer learning using multi-lingual BERT

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Bengali & Banglish: A monolingual dataset for emotion detection in linguistically diverse contexts

Sentiment Analysis of Code-Mixed Languages leveraging Resource Rich Languages

SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and its Evaluation

Enhancing Sentiment Analysis in Bengali Texts: A Hybrid Approach Using Lexicon-Based Algorithm and Pretrained Language Model Bangla-BERT

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

Sentiment Analysis of Persian-English Code-mixed Texts

Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network

JU_KS@SAIL_CodeMixed-2017: Sentiment Analysis for Indian Code Mixed Social Media Texts

Sentiment Analysis of Code-Mixed Social Media Text (Hinglish)

Sentiment Identification in Code-Mixed Social Media Text

A comprehensive dataset for sentiment and emotion classification from Bangladesh e-commerce reviews

CMSAOne@Dravidian-CodeMix-FIRE2020: A Meta Embedding and Transformer model for Code-Mixed Sentiment Analysis on Social Media Text

Improving code-mixed hate detection by native sample mixing: A case study for Hindi-English code-mixed scenario