COVID-19-related Nepali Tweets Classification in a Low Resource Setting

Rabin Adhikari,Safal Thapaliya,Nirajan Basnet,Samip Poudel,Aman Shakya,Bishesh Khanal
DOI: https://doi.org/10.48550/arXiv.2210.05425
2022-10-11
Abstract:Billions of people across the globe have been using social media platforms in their local languages to voice their opinions about the various topics related to the COVID-19 pandemic. Several organizations, including the World Health Organization, have developed automated social media analysis tools that classify COVID-19-related tweets into various topics. However, these tools that help combat the pandemic are limited to very few languages, making several countries unable to take their benefit. While multi-lingual or low-resource language-specific tools are being developed, they still need to expand their coverage, such as for the Nepali language. In this paper, we identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language, set up an online platform to automatically gather Nepali tweets containing the COVID-19-related keywords, classify the tweets into the eight topics, and visualize the results across the period in a web-based dashboard. We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification, one generic (mBERT) and the other Nepali language family-specific model (MuRIL). Our results show that the models' relative performance depends on the data size, with MuRIL doing better for a larger dataset. The annotated data, models, and the web-based dashboard are open-sourced at <a class="link-external link-https" href="https://github.com/naamiinepal/covid-tweet-classification" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?