Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning

Joshua Lois Cruz Paulino,Lexter Carl Antoja Almirol,Jun Marco Cruz Favila,Kent Alvin Gerald Loria Aquino,Angelica Hernandez De La Cruz,Rachel Edita Roxas
DOI: https://doi.org/10.1145/3485768.3485775
2021-08-21
Abstract:Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.
What problem does this paper attempt to address?