Leveraging distant supervision and deep learning for twitter sentiment and emotion classification

Muhamet Kastrati,Zenun Kastrati,Ali Shariq Imran,Marenglen Biba
DOI: https://doi.org/10.1007/s10844-024-00845-0
2024-03-24
Journal of Intelligent Information Systems
Abstract:Nowadays, various applications across industries, healthcare, and security have begun adopting automatic sentiment analysis and emotion detection in short texts, such as posts from social media. Twitter stands out as one of the most popular online social media platforms due to its easy, unique, and advanced accessibility using the API. On the other hand, supervised learning is the most widely used paradigm for tasks involving sentiment polarity and fine-grained emotion detection in short and informal texts, such as Twitter posts. However, supervised learning models are data-hungry and heavily reliant on abundant labeled data, which remains a challenge. This study aims to address this challenge by creating a large-scale real-world dataset of 17.5 million tweets. A distant supervision approach relying on emojis available in tweets is applied to label tweets corresponding to Ekman's six basic emotions. Additionally, we conducted a series of experiments using various conventional machine learning models and deep learning, including transformer-based models, on our dataset to establish baseline results. The experimental results and an extensive ablation analysis on the dataset showed that BiLSTM with FastText and an attention mechanism outperforms other models in both classification tasks, achieving an F1-score of 70.92% for sentiment classification and 54.85% for emotion detection.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of supervised learning models' high dependency on large amounts of labeled data in sentiment analysis and emotion detection tasks. Specifically, the goals of the paper are as follows: 1. **Automatically Create Large-Scale Sentiment Datasets**: - Utilize emojis on Twitter for distant supervision to automatically label Twitter datasets for sentiment polarity and emotion classification tasks. 2. **Evaluate the Performance of Different Models**: - Test various traditional machine learning models and deep learning models (including Transformer-based models) on the newly created dataset to establish benchmark results and explore suitable methods for sentiment polarity and emotion detection for this dataset. 3. **Improve Classifier Performance**: - Propose a multi-layer BiLSTM model that combines pre-trained word embedding techniques and attention mechanisms for sentiment polarity and multi-class emotion classification tasks. ### Main Research Questions - **RQ1**: How to automatically create a large-scale sentiment dataset using emojis on Twitter? - **RQ2**: How do the amount of training data and class imbalance affect the performance of traditional machine learning algorithms and deep neural networks? - **RQ3**: To what extent can pre-trained word embedding techniques and attention mechanisms improve performance in sentiment and emotion classification tasks? ### Core Contributions - Collected and organized a large-scale real-world Twitter dataset, automatically labeled using emojis according to the Ekman model. - Compared the performance of traditional machine learning algorithms and deep neural networks on sentiment polarity and emotion classification tasks. - Proposed a multi-layer BiLSTM model that combines pre-trained word embeddings and attention mechanisms for sentiment polarity and multi-class emotion classification. - Conducted ablation analysis to explore the impact of dataset size, number of classes, and class imbalance on classification performance. Through these efforts, the paper aims to overcome the issues of data scarcity and limited model generalization ability in existing sentiment analysis and emotion detection tasks.