Content analysis of Persian/Farsi Tweets during COVID-19 pandemic in Iran using NLP

Pedram Hosseini,Poorya Hosseini,David A. Broniatowski
DOI: https://doi.org/10.48550/arXiv.2005.08400
2020-05-18
Abstract:Iran, along with China, South Korea, and Italy was among the countries that were hit hard in the first wave of the COVID-19 spread. Twitter is one of the widely-used online platforms by Iranians inside and abroad for sharing their opinion, thoughts, and feelings about a wide range of issues. In this study, using more than 530,000 original tweets in Persian/Farsi on COVID-19, we analyzed the topics discussed among users, who are mainly Iranians, to gauge and track the response to the pandemic and how it evolved over time. We applied a combination of manual annotation of a random sample of tweets and topic modeling tools to classify the contents and frequency of each category of topics. We identified the top 25 topics among which living experience under home quarantine emerged as a major talking point. We additionally categorized broader content of tweets that shows satire, followed by news, is the dominant tweet type among the Iranian users. While this framework and methodology can be used to track public response to ongoing developments related to COVID-19, a generalization of this framework can become a useful framework to gauge Iranian public reaction to ongoing policy measures or events locally and internationally.
Social and Information Networks,Computation and Language,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the Iranian people's response to the COVID - 19 pandemic and its changes over time by analyzing Persian/Farsi Twitter content during the COVID - 19 pandemic in Iran. Specifically, researchers hope to classify and quantify the discussion topics and content types on Twitter through natural language processing techniques, especially topic modeling and manual annotation methods. The main objectives of the paper include: 1. **Identify main discussion topics**: By analyzing more than 530,000 Persian - language tweets related to COVID - 19, identify the main topics discussed by users. The study found that the experience of home quarantine has become a major discussion point. 2. **Classify Twitter content**: In addition to specific discussion topics, researchers also made a broader classification of Twitter content, such as satire, news, opinions, etc. The results show that satire is the most common content type, followed by news. 3. **Track public response**: Researchers hope that through this framework and method, they can track the Iranian public's response to the pandemic and related policy measures, and this framework can be extended to other major economic, political or health events to monitor the Iranian public's response. ### Main methods and techniques 1. **Data collection**: - Use the Social Feed Manager (SFM) platform to collect Twitter data through the Twitter Developer API. - Screen Persian - language tweets related to COVID - 19 and filter them using specific hashtags. 2. **Pre - processing**: - Only keep the original tweets and remove replies and quotes. - Remove URLs, emojis, punctuation marks and English numbers. - Use the Hazm library to normalize the Twitter text. - Create a special Persian - language stop - word list. 3. **Topic modeling**: - Use the Latent Dirichlet Allocation (LDA) model for topic analysis, with the number of topics set to 50. - Generate the LDA model through the Mallet tool and optimize the hyper - parameters. 4. **Manual annotation**: - Randomly select sample tweets for manual annotation and define multiple categories, such as "satire", "news", "opinion", etc. - Calculate the distribution of different categories and evaluate the degree of agreement between annotators. ### Main findings 1. **Main discussion topics**: - The experience of home quarantine is one of the most important discussion topics. - Other popular topics include news reports on the pandemic, criticism and suggestions of government measures, etc. 2. **Twitter content types**: - Satire is the most common content type, followed by news. - There are also some tweets expressing concerns and dissatisfaction about the pandemic, as well as criticism of government measures. 3. **Temporal change trends**: - The number of tweets decreased significantly near the Iranian New Year (Nowruz), probably because users reduced their social media activities during the festival. - Although the number of confirmed cases continued to increase, the number of tweets about the pandemic decreased, which may reflect a decrease in users' attention to the pandemic or an underestimation of the severity of the pandemic. ### Future directions 1. **Topic changes within the time window**: Further analyze the changes in discussion topics within different time windows to better understand the dynamic changes in public response. 2. **Continue manual annotation**: Conduct more manual annotations on newly collected data to improve the understanding and measurement accuracy of public response. 3. **Disinformation analysis**: In - depth analysis of the authenticity of the information shared on Twitter, identify different types of false information and their dissemination strategies. Through these methods and techniques, researchers hope to provide an effective tool and platform for monitoring and understanding the Iranian public's response during the pandemic and their response to other major events.