Abstract:Low-resource languages are gaining much-needed attention with the advent of deep learning models and pre-trained word embedding. Though spoken by more than 230 million people worldwide, Urdu is one such low-resource language that has recently gained popularity online and is attracting a lot of attention and support from the research community. One challenge faced by such resource-constrained languages is the scarcity of publicly available large-scale datasets for conducting any meaningful study. In this paper, we address this challenge by collecting the first-ever large-scale Urdu Tweet Dataset for sentiment analysis and emotion recognition. The dataset consists of a staggering number of 1, 140, 821 tweets in the Urdu language. Obviously, manual labeling of such a large number of tweets would have been tedious, error-prone, and humanly impossible; therefore, the paper also proposes a weakly supervised approach to label tweets automatically. Emoticons used within the tweets, in addition to SentiWordNet, are utilized to propose a weakly supervised labeling approach to categorize extracted tweets into positive, negative, and neutral categories. Baseline deep learning models are implemented to compute the accuracy of three labeling approaches, i.e., VADER, TextBlob, and our proposed weakly supervised approach. Unlike the weakly supervised labeling approach, the VADER and TextBlob put most tweets as neutral and show a high correlation between the two. This is largely attributed to the fact that these models do not consider emoticons for assigning polarity.

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification

Semantic Enrichment of Nigerian Pidgin English for Contextual Sentiment Classification

HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis

A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages

Building a Sentiment Corpus of Tweets in Brazilian Portuguese

Sentiment Analysis Across Multiple African Languages: A Current Benchmark

Sentiment Analysis of Multilingual Tweets based on Natural Language Processing (NLP)

Lexicon dataset for the Hausa language

yosm: A new yoruba sentiment corpus for movie reviews

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Development of a General Purpose Sentiment Lexicon for Igbo Language

SentiUrdu-1M: A large-scale tweet dataset for Urdu text sentiment analysis using weakly supervised learning

Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data

pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks

A survey of sentiment analysis in the Portuguese language

Bambara Language Dataset for Sentiment Analysis

Urdu Speech and Text Based Sentiment Analyzer

Early restenosis following successful percutaneous balloon valvuloplasty for calcific valvular aortic stenosis.

UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages