Abstract:Researchers use Twitter and sentiment analysis to predict Cardiovascular Disease (CVD) risk. We developed a new dictionary of CVD-related keywords by analyzing emotions expressed in tweets. Tweets from eighteen US states, including the Appalachian region, were collected. Using the VADER model for sentiment analysis, users were classified as potentially at CVD risk. Machine Learning (ML) models were employed to classify individuals' CVD risk and applied to a CDC dataset with demographic information to make the comparison. Performance evaluation metrics such as Test Accuracy, Precision, Recall, F1 score, Mathew's Correlation Coefficient (MCC), and Cohen's Kappa (CK) score were considered. Results demonstrated that analyzing tweets' emotions surpassed the predictive power of demographic data alone, enabling the identification of individuals at potential risk of developing CVD. This research highlights the potential of Natural Language Processing (NLP) and ML techniques in using tweets to identify individuals with CVD risks, providing an alternative approach to traditional demographic information for public health monitoring.

What problem does this paper attempt to address?

The paper attempts to address the problem of predicting cardiovascular disease (CVD) risk by analyzing data from social media. Specifically, the researchers developed a new dictionary to identify keywords related to cardiovascular disease and used sentiment analysis techniques to classify Twitter users to determine which users might be at risk of cardiovascular disease. Additionally, the study compared the performance of prediction models based on Twitter data with those based on demographic information to validate the effectiveness of social media data in predicting cardiovascular disease risk. ### Main Research Objectives: 1. **Develop a new dictionary**: Create a dictionary containing keywords related to cardiovascular disease to extract relevant information from Twitter. 2. **Sentiment analysis**: Use the VADER model to perform sentiment analysis on tweets, categorizing users into potential cardiovascular disease risk groups and non-risk groups. 3. **Machine learning models**: Apply various machine learning and deep learning models (such as CNN-LSTM, BNB, SVM, LR, CatBoost) to predict individuals' cardiovascular disease risk. 4. **Performance evaluation**: Evaluate the performance of the models using metrics such as accuracy, precision, recall, F1 score, Matthews correlation coefficient (MCC), and Cohen's Kappa (CK). 5. **Compare different data sources**: Compare the prediction results based on Twitter data with those based on demographic data from the Centers for Disease Control and Prevention (CDC) to validate the effectiveness of social media data. ### Research Background: Cardiovascular disease is one of the leading causes of death, especially in the United States. Traditional risk prediction methods mainly rely on demographic information, but this information may not be comprehensive. Social media data (such as Twitter) contains a wealth of information about individuals' psychological states and behaviors, which can be used to supplement traditional data and improve prediction accuracy. ### Research Methods: 1. **Data collection**: Collected 269,969 tweets from 18 states in the United States (including the Appalachian region) spanning from 2019 to 2021. 2. **Data preprocessing**: Preprocessed the Twitter data, including tokenization, stemming, removal of stop words, and punctuation. 3. **Sentiment analysis**: Used the VADER model to perform sentiment analysis on tweets, setting a threshold of -0.30 to distinguish between positive and negative sentiments. 4. **Model training and testing**: Split the dataset into training and testing sets, using various machine learning and deep learning models for training and testing. 5. **Performance evaluation**: Evaluated the prediction performance of the models using multiple performance metrics. ### Research Results: - **Twitter dataset**: The SVM model had the highest test accuracy at 88.75%, followed by the LR model with an accuracy of 87.82%. - **CDC dataset**: The LR model had the highest test accuracy at 58.03%, followed by the BNB model with an accuracy of 57.93%. - **Comparison results**: Overall, models based on Twitter data performed better in predicting cardiovascular disease risk than models based on CDC data. ### Conclusion: The research results indicate that by analyzing social media data (such as Twitter), combined with natural language processing and machine learning techniques, it is possible to effectively predict individuals' cardiovascular disease risk. This approach is, in some aspects, more effective than traditional methods that rely solely on demographic information.

Cardiovascular Disease Risk Prediction via Social Media

Advancements in Cardiovascular Disease Detection: Leveraging Data Mining and Machine Learning

Machine Learning Models for Cardiovascular Disease Prediction: A Comparative Study

A comprehensive study of machine learning for predicting cardiovascular disease using Weka and Statistical Package for Social Sciences tools

Efficient Data-Driven Machine Learning Models for Cardiovascular Diseases Risk Prediction

Identifying the Main Risk Factors for Cardiovascular Diseases Prediction Using Machine Learning Algorithms

Mapping the Heartbeat of America with ChatGPT-4: Unpacking the Interplay of Social Vulnerability, Digital Literacy, and Cardiovascular Mortality in County Residency Choices

Integrated Machine Learning Model for Comprehensive Heart Disease Risk Assessment Based on Multi-Dimensional Health Factors

A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction

A Data Mining Approach to Predict Risk of Cardiovascular

Predicting incident cardiovascular disease among African-American adults: A deep learning approach to evaluate social determinants of health in the Jackson heart study

Exploring Predictive Methods for Cardiovascular Disease: A Survey of Methods and Applications

Enhanced Cardiovascular Disease Prediction Modelling using Machine Learning Techniques: A Focus on CardioVitalnet

Enhancing Cardiovascular Disease Risk Prediction with Machine Learning Models

Improving Cardiovascular Disease Prediction With Machine Learning Using Mental Health Data: A Prospective UK Biobank Study

Machine Learning Methods in Real-World Studies of Cardiovascular Disease

Machine Learning Implementations for Multi-class Cardiovascular Risk Prediction in Family Health Units

Multimodal Learning for Cardiovascular Risk Prediction using EHR Data

Machine Learning Models for the Identification of Cardiovascular Diseases Using UK Biobank Data

Multilayer Perceptron Neural Network with Arithmetic Optimization Algorithm-Based Feature Selection for Cardiovascular Disease Prediction

Discovering biomarkers associated and predicting cardiovascular disease with high accuracy using a novel nexus of machine learning techniques for precision medicine