Abstract:Objective Current studies leveraging social media data for disease monitoring face challenges like noisy colloquial language and insufficient tracking of user disease progression in longitudinal data settings. This study aims to develop a pipeline for collecting, cleaning, and analyzing large-scale longitudinal social media data for disease monitoring, with a focus on COVID-19 pandemic. Materials and Methods This pipeline initiates by screening COVID-19 cases from tweets spanning February 1, 2020, to April 30, 2022. Longitudinal data is collected for each patient, two months before and three months after self-reporting. Symptoms are extracted using Name Entity Recognition (NER), followed by denoising with a combination of Graph Convolutional Network (GCN) and Bidirectional Encoder Representations from Transformers (BERT) model to retain only User Symptom Mentions (USM). Subsequently, symptoms are mapped to standardized medical concepts using the Unified Medical Language System (UMLS). Finally, this study conducts symptom pattern analysis and visualization to illustrate temporal changes in symptom prevalence and co-occurrence. Results This study identified 191,096 self-reported COVID-19-positive cases from COVID-19-related tweets and retrospectively collected 811,398,280 historical tweets, of which 2,120,964 contained symptoms information. After denoising, 39% (832,287) of symptom-sharing tweets reflected user-related mentions. The trained USM model achieved an F1 score of 0.926. Further analysis revealed a higher prevalence of upper respiratory tract symptoms during the Omicron period compared to the Delta and wild-type periods. Additionally, there was a pronounced co-occurrence of lower respiratory tract and nervous system symptoms in the wild-type strain and Delta variant. Conclusion This study established a robust framework for pandemic monitoring via social media, integrating denoising of user-related symptom mentions and longitudinal data. The findings underscore the importance of denoising procedures in revealing accurate prevalence trends, thereby minimizing biases in symptom analysis. Keywords Natural language processing, deep learning, social media, public health, COVID-19, symptom surveillance ### Competing Interest Statement The authors have declared no competing interest. ### Funding Statement This research received no specific grant from any funding agency in public, commercial or not-for-profit sectors. ### Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: Ethics committee of School of Public Health, Zhejiang University gave ethical approval for this work. (Approval number: ZGL202201-2) I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable. Yes All data produced in the present study are available upon reasonable request to the authors.

Assessing the Performance of Machine Learning Methods Trained on Public Health Observational Data: A Case Study From COVID-19

Sounds of COVID-19: exploring realistic performance of audio-based digital testing

Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers

Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data

Limitations of the Cough Sound-Based COVID-19 Diagnosis Artificial Intelligence Model and its Future Direction: Longitudinal Observation Study

Acoustic and Clinical Data Analysis of Vocal Recordings: Pandemic Insights and Lessons

Omicron detection with large language models and YouTube audio data

A large-scale and PCR-referenced vocal audio dataset for COVID-19

Exploring Longitudinal Cough, Breath, and Voice Data for COVID-19 Progression Prediction via Sequential Deep Learning: Model Development and Validation

Machine Learning Techniques for Sentiment Analysis of COVID-19-Related Twitter Data

Detecting COVID-19 from Breathing and Coughing Sounds using Deep Neural Networks

Using Machine Learning Technology (Early Artificial Intelligence-Supported Response With Social Listening Platform) to Enhance Digital Social Understanding for the COVID-19 Infodemic: Development and Implementation Study

Deep learning and machine learning-based voice analysis for the detection of COVID-19: A proposal and comparison of architectures

Denoising Longitudinal Social Media for Pandemic Monitoring

Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough

Hi Sigma, do I have the Coronavirus?: Call for a New Artificial Intelligence Approach to Support Health Care Professionals Dealing With The COVID-19 Pandemic

Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task

An Early Study on Intelligent Analysis of Speech under COVID-19: Severity, Sleep Quality, Fatigue, and Anxiety

Respiratory Diseases Diagnosis Using Audio Analysis and Artificial Intelligence: A Systematic Review

Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms

Comparative assessment of machine learning algorithms to predict severity of disease in COVID-19 patients based on eight cofactors