Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

Henry Tari,Danial Khan,Justus Rutten,Darian Othman,Rishabh Kaushal,Thales Bertaglia,Adriana Iamnitchi
2024-07-11
Abstract:Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
Computers and Society
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the difficulty in obtaining multi - platform social media datasets. Specifically, researchers face the following challenges: 1. **High cost of data acquisition**: The cost of obtaining data from multiple social media platforms is high. 2. **Platform restrictions**: Many social media platforms impose strict restrictions on data access, making it difficult for researchers to obtain the required data. 3. **Limited data sharing**: Due to privacy and legal issues, even if social posts are public, these datasets are usually not shared with the research community. 4. **Complexity of multi - platform data**: Obtaining data covering multiple platforms is more complex because each platform has its unique content form and user behavior pattern. To address these problems, the paper explores the possibility of using large - language models (LLMs), especially GPT models, to generate synthetic datasets across multiple social media platforms. The goals of the research are: - **Generate high - quality synthetic data**: Ensure that the generated synthetic data is similar to real data in terms of vocabulary and semantics, so that various studies can be carried out without the need for actual data. - **Promote the reproducibility of research**: By providing synthetic datasets, enable other researchers to reproduce research results without violating laws or platform regulations. - **Improve transparency**: Enhance the transparency of social media content management, help better protect systemic risks, and support regulations such as the EU Digital Services Act. ### Research methods Researchers selected two multi - platform datasets for experiments: 1. **The 2022 US mid - term elections**: Posts from Twitter, Facebook, and Reddit. 2. **Dutch social media influencers**: Posts from TikTok, Instagram, and YouTube. They used ChatGPT to generate synthetic data and evaluated the quality of the generated data in the following ways: - **Lexical features**: Compare the number and diversity of hashtags, user tags, URLs, and emojis in the generated data and the real data. - **Sentiment analysis**: Use a pre - trained sentiment classification model to evaluate the sentiment tendency of the generated content. - **Topic generation and overlap**: Extract topics by BERTopic and compare the topic distributions of the generated data and the real data. ### Main findings - **Lexical features**: The generated synthetic data overuses hashtags on some platforms (such as Twitter and Reddit), while underuses them on other platforms (such as Instagram and TikTok). In addition, the reuse rate of generated hashtags and user tags is low. - **Sentiment analysis**: The generated content is generally more positive and has less negative sentiment, which may be because the model intentionally reduces the generation of negative content. - **Topic generation**: The generated synthetic data can well reproduce the topics of real data, but on some platforms, there are differences in topic diversity and overlap. Overall, the research shows that it is feasible to use large - language models to generate multi - platform social media synthetic data, but further improvements are needed to enhance the authenticity and diversity of the generated data.