Cardiovascular Disease Risk Prediction via Social Media

Al Zadid Sultan Bin Habib,Md Asif Bin Syed,Md Tanvirul Islam,Donald A. Adjeroh
2023-09-29
Abstract:Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of predicting cardiovascular disease (CVD) risk by analyzing data from social media. Specifically, the researchers developed a new dictionary to identify keywords related to cardiovascular disease and used sentiment analysis techniques to classify Twitter users to determine which users might be at risk of cardiovascular disease. Additionally, the study compared the performance of prediction models based on Twitter data with those based on demographic information to validate the effectiveness of social media data in predicting cardiovascular disease risk. ### Main Research Objectives: 1. **Develop a new dictionary**: Create a dictionary containing keywords related to cardiovascular disease to extract relevant information from Twitter. 2. **Sentiment analysis**: Use the VADER model to perform sentiment analysis on tweets, categorizing users into potential cardiovascular disease risk groups and non-risk groups. 3. **Machine learning models**: Apply various machine learning and deep learning models (such as CNN-LSTM, BNB, SVM, LR, CatBoost) to predict individuals' cardiovascular disease risk. 4. **Performance evaluation**: Evaluate the performance of the models using metrics such as accuracy, precision, recall, F1 score, Matthews correlation coefficient (MCC), and Cohen's Kappa (CK). 5. **Compare different data sources**: Compare the prediction results based on Twitter data with those based on demographic data from the Centers for Disease Control and Prevention (CDC) to validate the effectiveness of social media data. ### Research Background: Cardiovascular disease is one of the leading causes of death, especially in the United States. Traditional risk prediction methods mainly rely on demographic information, but this information may not be comprehensive. Social media data (such as Twitter) contains a wealth of information about individuals' psychological states and behaviors, which can be used to supplement traditional data and improve prediction accuracy. ### Research Methods: 1. **Data collection**: Collected 269,969 tweets from 18 states in the United States (including the Appalachian region) spanning from 2019 to 2021. 2. **Data preprocessing**: Preprocessed the Twitter data, including tokenization, stemming, removal of stop words, and punctuation. 3. **Sentiment analysis**: Used the VADER model to perform sentiment analysis on tweets, setting a threshold of -0.30 to distinguish between positive and negative sentiments. 4. **Model training and testing**: Split the dataset into training and testing sets, using various machine learning and deep learning models for training and testing. 5. **Performance evaluation**: Evaluated the prediction performance of the models using multiple performance metrics. ### Research Results: - **Twitter dataset**: The SVM model had the highest test accuracy at 88.75%, followed by the LR model with an accuracy of 87.82%. - **CDC dataset**: The LR model had the highest test accuracy at 58.03%, followed by the BNB model with an accuracy of 57.93%. - **Comparison results**: Overall, models based on Twitter data performed better in predicting cardiovascular disease risk than models based on CDC data. ### Conclusion: The research results indicate that by analyzing social media data (such as Twitter), combined with natural language processing and machine learning techniques, it is possible to effectively predict individuals' cardiovascular disease risk. This approach is, in some aspects, more effective than traditional methods that rely solely on demographic information.