CML-COVID: A Large-Scale COVID-19 Twitter Dataset with Latent Topics, Sentiment and Location Information

Hassan Dashtian,Dhiraj Murthy
DOI: https://doi.org/10.48550/arXiv.2101.12202
2021-01-29
Abstract:As a platform, Twitter has been a significant public space for discussion related to the COVID-19 pandemic. Public social media platforms such as Twitter represent important sites of engagement regarding the pandemic and these data can be used by research teams for social, health, and other research. Understanding public opinion about COVID-19 and how information diffuses in social media is important for governments and research institutions. Twitter is a ubiquitous public platform and, as such, has tremendous utility for understanding public perceptions, behavior, and attitudes related to COVID-19. In this research, we present CML-COVID, a COVID-19 Twitter data set of 19,298,967 million tweets from 5,977,653 unique individuals and summarize some of the attributes of these data. These tweets were collected between March 2020 and July 2020 using the query terms coronavirus, covid and mask related to COVID-19. We use topic modeling, sentiment analysis, and descriptive statistics to describe the tweets related to COVID-19 we collected and the geographical location of tweets, where available. We provide information on how to access our tweet dataset (archived using twarc).
Social and Information Networks,Computers and Society,Human-Computer Interaction
What problem does this paper attempt to address?
This paper aims to address the problem of using social media data related to COVID - 19, especially by analyzing a large - scale data set on Twitter about COVID - 19 to understand public opinions, information dissemination patterns and geographical distribution characteristics. Specifically, the paper attempts to solve the following problems: 1. **Understanding of public opinions**: Research on the public's attitudes, perceptions and behavioral responses to COVID - 19 on the Twitter platform. This includes identifying and analyzing the emotional tendencies (such as positive, negative or neutral) related to the epidemic, as well as the changing trends of these emotions over time. 2. **Information dissemination patterns**: Explore how information related to COVID - 19 is disseminated on social media, especially focusing on the impact of information authenticity on public behavior and beliefs. This involves identifying and analyzing the speed, scope of information diffusion and its potential impact on society. 3. **Geographical location analysis**: Analyze the distribution of discussions about COVID - 19 on Twitter in different regions around the world, including the differences between different countries and cities. This helps to understand the attention and responses to the epidemic in different regions. 4. **Construction and sharing of data sets**: Construct a large - scale data set containing 19,298,967 tweets, and provide detailed data collection methods and pre - processing steps so that other researchers can repeat the experiment or conduct further analysis. In addition, the paper also provides specific guidance on how to access and use these data, including the method of obtaining complete tweets through tweet IDs. In summary, by constructing and analyzing a large - scale Twitter data set, this paper aims to provide important insights for governments and research institutions regarding the public's attitudes towards COVID - 19, information dissemination patterns and geographical distribution, thereby supporting the formulation of public health policies and social science research.