Abstract:Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the difficulty in obtaining multi - platform social media datasets. Specifically, researchers face the following challenges: 1. **High cost of data acquisition**: The cost of obtaining data from multiple social media platforms is high. 2. **Platform restrictions**: Many social media platforms impose strict restrictions on data access, making it difficult for researchers to obtain the required data. 3. **Limited data sharing**: Due to privacy and legal issues, even if social posts are public, these datasets are usually not shared with the research community. 4. **Complexity of multi - platform data**: Obtaining data covering multiple platforms is more complex because each platform has its unique content form and user behavior pattern. To address these problems, the paper explores the possibility of using large - language models (LLMs), especially GPT models, to generate synthetic datasets across multiple social media platforms. The goals of the research are: - **Generate high - quality synthetic data**: Ensure that the generated synthetic data is similar to real data in terms of vocabulary and semantics, so that various studies can be carried out without the need for actual data. - **Promote the reproducibility of research**: By providing synthetic datasets, enable other researchers to reproduce research results without violating laws or platform regulations. - **Improve transparency**: Enhance the transparency of social media content management, help better protect systemic risks, and support regulations such as the EU Digital Services Act. ### Research methods Researchers selected two multi - platform datasets for experiments: 1. **The 2022 US mid - term elections**: Posts from Twitter, Facebook, and Reddit. 2. **Dutch social media influencers**: Posts from TikTok, Instagram, and YouTube. They used ChatGPT to generate synthetic data and evaluated the quality of the generated data in the following ways: - **Lexical features**: Compare the number and diversity of hashtags, user tags, URLs, and emojis in the generated data and the real data. - **Sentiment analysis**: Use a pre - trained sentiment classification model to evaluate the sentiment tendency of the generated content. - **Topic generation and overlap**: Extract topics by BERTopic and compare the topic distributions of the generated data and the real data. ### Main findings - **Lexical features**: The generated synthetic data overuses hashtags on some platforms (such as Twitter and Reddit), while underuses them on other platforms (such as Instagram and TikTok). In addition, the reuse rate of generated hashtags and user tags is low. - **Sentiment analysis**: The generated content is generally more positive and has less negative sentiment, which may be because the model intentionally reduces the generation of negative content. - **Topic generation**: The generated synthetic data can well reproduce the topics of real data, but on some platforms, there are differences in topic diversity and overlap. Overall, the research shows that it is feasible to use large - language models to generate multi - platform social media synthetic data, but further improvements are needed to enhance the authenticity and diversity of the generated data.

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection

Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)

Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions

Exploring the Application Potential of the Large Language Model in Sociological Research: A Case Study of ChatGPT

Potential use of large language models for mitigating students’ problematic social media use: ChatGPT as an example

Large Language Models Can Infer Psychological Dispositions of Social Media Users

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Voices from the algorithm: Large language models in social research

Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites

Towards Coding Social Science Datasets with Language Models

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

A procedure for the strategic planning of locations, capacities and districting of jails: application to Chile

Perils and opportunities in using large language models in psychological research

Exploring the Potential of AI-Generated Synthetic Datasets: A Case Study on Telematics Data with ChatGPT

A Glimpse in ChatGPT Capabilities and its impact for AI research

LLMs Among Us: Generative AI Participating in Digital Discourse

ChatGPT and large language models in academia: opportunities and challenges

ChatGPT in the Age of Generative AI and Large Language Models: A Concise Survey

ChatGPT and a New Academic Reality: Artificial Intelligence-Written Research Papers and the Ethics of the Large Language Models in Scholarly Publishing

An exploratory survey about using ChatGPT in education, healthcare, and research