BillionCOV: An enriched billion-scale collection of COVID-19 tweets for efficient hydration

Rabindra Lamsal,Maria Rodriguez Read,Shanika Karunasekera
DOI: https://doi.org/10.1016/j.dib.2023.109229
IF: 1.2
2023-06-01
Data in Brief
Abstract:The COVID-19 pandemic has introduced new norms, such as social distancing, face masks, quarantine, lockdowns, travel restrictions, work/study from home, and business closures, to name a few. The pandemic's seriousness has made people vocal on social media, especially on microblogs such as Twitter. Since the early days of the outbreak, researchers have been collecting and sharing large-scale datasets of COVID-19 tweets. However, the existing datasets carry issues related to <i>proportion</i> and <i>redundancy</i>. We report that more than 500 million tweet identifiers point to deleted or protected tweets. To address these issues, this paper introduces an enriched global billion-scale English-language COVID-19 tweets dataset, <i>BillionCOV</i>, which contains 1.4 billion tweets originating from 240 countries and territories between October 2019 and April 2022. Importantly, <i>BillionCOV</i> facilitates researchers to filter tweet identifiers for efficient hydration. We anticipate that the dataset of this scale with global scope and extended temporal coverage will aid in obtaining a thorough understanding of the pandemic's conversational dynamics.
What problem does this paper attempt to address?