COVIDHealth: A Benchmark Twitter Dataset and Machine Learning based Web Application for Classifying COVID-19 Discussions

Mahathir Mohammad Bishal,Md. Rakibul Hassan Chowdory,Anik Das,Muhammad Ashad Kabir
2024-02-15
Abstract:The COVID-19 pandemic has had adverse effects on both physical and mental health. During this pandemic, numerous studies have focused on gaining insights into health-related perspectives from social media. In this study, our primary objective is to develop a machine learning-based web application for automatically classifying COVID-19-related discussions on social media. To achieve this, we label COVID-19-related Twitter data, provide benchmark classification results, and develop a web application. We collected data using the Twitter API and labeled a total of 6,667 tweets into five different classes: health risks, prevention, symptoms, transmission, and treatment. We extracted features using various feature extraction methods and applied them to seven different traditional machine learning algorithms, including Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbour, Logistic Regression, and Linear SVC. Additionally, we used four deep learning algorithms: LSTM, CNN, RNN, and BERT, for classification. Overall, we achieved a maximum F1 score of 90.43% with the CNN algorithm in deep learning. The Linear SVC algorithm exhibited the highest F1 score at 86.13%, surpassing other traditional machine learning approaches. Our study not only contributes to the field of health-related data analysis but also provides a valuable resource in the form of a web-based tool for efficient data classification, which can aid in addressing public health challenges and increasing awareness during pandemics. We made the dataset and application publicly available, which can be downloaded from this link
Machine Learning,Social and Information Networks
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically classify COVID - 19 - related discussions on social media platforms. Specifically, the researchers aim to develop a machine - learning - based web application that can automatically classify discussions about COVID - 19 on Twitter. To achieve this goal, they first collected Twitter data related to COVID - 19 and labeled it into five different categories: health risks, preventive measures, symptoms, transmission methods, and treatment. By using multiple feature extraction methods and applying different machine - learning and deep - learning algorithms, the researchers not only provided benchmark results for these classification tasks but also developed a web application to demonstrate its practical application value. ### Main Contributions 1. **New Dataset**: Introduced a new Twitter dataset related to COVID - 19, containing 6,667 tweets, which are divided into five key categories: health risks, prevention, symptoms, transmission, and treatment. 2. **Benchmark Classification Performance**: Conducted a comprehensive empirical study on the dataset through traditional machine - learning and deep - learning methods, providing benchmark classification performance. 3. **Practical Application**: Developed a prototype of a web application based on a Chrome extension, using the best - performing model, demonstrating its practical application value. ### Method Overview 1. **Data Collection**: Obtained tweet IDs from public datasets and retrieved the corresponding tweet texts using the Twitter API. 2. **Data Labeling**: Two independent labelers labeled the collected tweets, and then a third expert reviewed them to ensure the accuracy and consistency of the labeling. 3. **Pre - processing and Feature Extraction**: Pre - processed the original text data, including removing irrelevant elements (such as mentions, hashtags, URLs, etc.), and used three different feature extraction methods (TF - IDF, LIWC, POS tagging). 4. **Classification**: Input the extracted features into multiple traditional machine - learning and deep - learning algorithms for classification and evaluated the performance of different algorithms. 5. **Performance Evaluation and Application Development**: Conducted a comprehensive analysis of the performance of various classifiers and developed a prototype of a web application based on the best model. ### Experimental Results - **Deep - learning Algorithms**: Among the deep - learning algorithms, CNN achieved the highest F1 score of 90.43%. - **Traditional Machine - learning Algorithms**: Among the traditional machine - learning algorithms, Linear SVC exhibited the highest F1 score of 86.13%. ### Conclusion This research not only provides a valuable resource for the analysis of health - related data but also demonstrates how to use these classification results to address public health challenges and raise public awareness by developing a practical web application. The dataset and application have been publicly released for use by other researchers.