Towards a Social Media-based Disease Surveillance System for Early Detection of Influenza-like Illnesses: A Twitter Case Study in Wales

Mark Drakesmith,Dimosthenis Antypas,Claire Brown,Jose Camacho-Collados,Jiao Song
DOI: https://doi.org/10.1101/2024.11.11.24316812
2024-11-11
Abstract:Social media offers the potential to provide detection of outbreaks or public health incidents faster than traditional reporting mechanisms. In this paper, we developed and tested a pipeline to produce alerts of influenza-like illness (ILI) using Twitter data. Data was collected from the Twitter API, querying keywords referring to ILI symptoms and geolocated to Wales. Tweets that described first-hand descriptions of symptoms (as opposed to non-personal descriptions) were classified using transformer-based language models specialised on social media (BERTweet and TimeLMs), which were trained on a manually labelled dataset matching the above criteria. After gathering this data, weekly tweet counts were applied to the regression-based Noufaily algorithm to identify exceedances throughout 2022. The algorithm was also applied to counts of ILI-related GP consultations for comparison. Exceedance detection applied to the classified tweet counts produced alerts starting four weeks earlier than by using GP consultation data. These results demonstrate the potential to facilitate advanced preparedness for unexpected increases in healthcare burdens.
Epidemiology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to detect the outbreak of influenza - like illnesses (ILI) in advance through social media data (especially Twitter data), in order to provide an earlier warning system than traditional clinical data. Specifically, the researchers developed a pipeline to collect tweets related to ILI symptoms from Twitter and classify these tweets through transformer - based language models optimized for social media (such as BERTweet and TimeLMs) to identify first - hand descriptions of individual ILI symptoms. Then, the regression - based Noufaily algorithm was used to perform anomaly detection on the counts of these classified tweets to identify the surge in ILI cases in 2022. The research results show that the warning system using Twitter data can issue warnings four weeks earlier than using GP consultation data, which helps to prepare in advance for the possible increase in the healthcare burden. ### Research Background Traditional infectious disease symptom surveillance (also known as syndromic surveillance) relies on clinical data, such as general practitioner (GP) consultation records, ambulance calls, absence from work due to illness, and the use of telephone consultation services. However, these data sources may be delayed due to the reporting mechanism. In recent years, there has been increasing interest in using social media data to develop early - warning systems for infectious diseases. Social media provides a fast, high - volume data source and can reliably detect disease outbreaks or public health events more quickly than traditional reporting mechanisms. ### Methods 1. **Data Collection**: Tweets from Wales were collected from January 2020 to January 2023 using the Twitter Academic API, and only English and Welsh tweets were collected. 2. **Keyword Matching**: 22 keywords related to ILI symptoms were applied to filter tweets. 3. **Classification**: NLP - based models (such as BERTweet and Twitter - RoBERTa) were trained with manually - labeled data to classify tweets as first - hand symptom descriptions. 4. **Anomaly Detection**: The Noufaily algorithm was used to perform anomaly detection on the counts of classified tweets and compared with GP consultation data. ### Results - **Classification Results**: The BERTweet model performed slightly better than Twitter - RoBERTa in the classification task, especially in identifying first - hand symptom descriptions. - **Anomaly Detection**: Using the tweet counts classified by BERTweet, the anomaly detection algorithm triggered alarms in weeks 44 - 46 and 50 - 52 of 2022, while the alarms in GP consultation data appeared in week 48 and later. This means that the warning system using Twitter data is about four weeks earlier than the traditional method. ### Discussion - **Practicality**: The research shows that social media data can be used as a supplementary means to help public health departments detect the start of the flu season or an abnormally severe epidemic earlier. - **Limitations**: There are some limitations in the research, such as the scarcity of geographical location data, language differences among social media users, API access restrictions, etc. In addition, the willingness of social media users to share personal health information is also decreasing, which may affect the reliability of the system. ### Conclusion The research proves that the syndromic surveillance system based on social media can provide an early warning of the healthcare burden before traditional indicators and has practical application value. However, issues such as data access, digital representativeness, and changes in social media usage patterns need to be further verified and resolved.