Abstract:Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore the effectiveness and challenges of using large - language models (LLMs), especially ChatGPT (gpt - 3.5 - turbo), to generate synthetic Instagram data for sponsored content detection (SCD). Specifically, the paper focuses on two main objectives: 1. **Fidelity**: Whether the generated synthetic data is realistic enough to simulate real Instagram posts. 2. **Utility**: Whether the generated synthetic data is helpful for training classifiers to identify undisclosed advertisements. #### Problem Background Instagram is one of the main platforms for influencer marketing and thus also a major channel for posting sponsored content. According to the law, advertisements must be clearly disclosed to ensure transparency and protect consumers from being misled or affected by harmful advertisements. However, due to limited API access rights and high data - labeling costs, the data used to develop machine - learning solutions is very scarce. In addition, undisclosed advertisements are inherently difficult to identify and collect. #### Research Questions The core research question of the paper is: Can ChatGPT 3.5 bridge this gap by generating realistic - enough synthetic data to effectively solve the problem of sponsored content detection? #### Main Findings - **Conflict between Fidelity and Utility**: The study found that there may be a conflict between fidelity and utility. Although individual synthetic posts may look realistic, they lack diversity, thematic coherence, and real - user - interaction patterns as a whole. - **The Role of Prompt Engineering**: Prompt engineering is a useful strategy, but it is not sufficient to completely solve the problem. Other methods, such as post - processing, need to be combined to improve the quality of synthetic data. - **Lack of Diversity and Representativeness**: Synthetic data has an unrealistic distribution in terms of content diversity, representation, and structural connectivity, indicating that current models still face challenges in generating diverse and highly - realistic content. Through these findings, the paper provides guidance for evaluating the application of other LLMs in synthetic - data generation and emphasizes the importance of considering fidelity and utility separately when evaluating synthetic data.

InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection

Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

Closing the Loop: Testing ChatGPT to Generate Model Explanations to Improve Human Labelling of Sponsored Content on Social Media

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Assessing the risks and opportunities posed by AI-enhanced influence operations on social media

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

A Synthetic Dataset for Personal Attribute Inference

Using Neural Generative Models to Release Synthetic Twitter Corpora with Reduced Stylometric Identifiability of Users

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

LLMs Among Us: Generative AI Participating in Digital Discourse

A Survey on Detection of LLMs-Generated Content

The Potential and Limitations of Large Language Models for Text Classification through Synthetic Data Generation

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

AI "News" Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

Artificial Data, Real Insights: Evaluating Opportunities and Risks of Expanding the Data Ecosystem with Synthetic Data

Unmasking the Imposters: How Censorship and Domain Adaptation Affect the Detection of Machine-Generated Tweets

ChatGPT and large language models in academia: opportunities and challenges

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Formalizing content creation and evaluation methods for AI-generated social media content