InstaSynth: Opportunities and Challenges in Generating Synthetic Instagram Data with ChatGPT for Sponsored Content Detection

Thales Bertaglia,Lily Heisig,Rishabh Kaushal,Adriana Iamnitchi
2024-03-22
Abstract:Large Language Models (LLMs) raise concerns about lowering the cost of generating texts that could be used for unethical or illegal purposes, especially on social media. This paper investigates the promise of such models to help enforce legal requirements related to the disclosure of sponsored content online. We investigate the use of LLMs for generating synthetic Instagram captions with two objectives: The first objective (fidelity) is to produce realistic synthetic datasets. For this, we implement content-level and network-level metrics to assess whether synthetic captions are realistic. The second objective (utility) is to create synthetic data that is useful for sponsored content detection. For this, we evaluate the effectiveness of the generated synthetic data for training classifiers to identify undisclosed advertisements on Instagram. Our investigations show that the objectives of fidelity and utility may conflict and that prompt engineering is a useful but insufficient strategy. Additionally, we find that while individual synthetic posts may appear realistic, collectively they lack diversity, topic connectivity, and realistic user interaction patterns.
Computers and Society,Computation and Language,Social and Information Networks
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore the effectiveness and challenges of using large - language models (LLMs), especially ChatGPT (gpt - 3.5 - turbo), to generate synthetic Instagram data for sponsored content detection (SCD). Specifically, the paper focuses on two main objectives: 1. **Fidelity**: Whether the generated synthetic data is realistic enough to simulate real Instagram posts. 2. **Utility**: Whether the generated synthetic data is helpful for training classifiers to identify undisclosed advertisements. #### Problem Background Instagram is one of the main platforms for influencer marketing and thus also a major channel for posting sponsored content. According to the law, advertisements must be clearly disclosed to ensure transparency and protect consumers from being misled or affected by harmful advertisements. However, due to limited API access rights and high data - labeling costs, the data used to develop machine - learning solutions is very scarce. In addition, undisclosed advertisements are inherently difficult to identify and collect. #### Research Questions The core research question of the paper is: Can ChatGPT 3.5 bridge this gap by generating realistic - enough synthetic data to effectively solve the problem of sponsored content detection? #### Main Findings - **Conflict between Fidelity and Utility**: The study found that there may be a conflict between fidelity and utility. Although individual synthetic posts may look realistic, they lack diversity, thematic coherence, and real - user - interaction patterns as a whole. - **The Role of Prompt Engineering**: Prompt engineering is a useful strategy, but it is not sufficient to completely solve the problem. Other methods, such as post - processing, need to be combined to improve the quality of synthetic data. - **Lack of Diversity and Representativeness**: Synthetic data has an unrealistic distribution in terms of content diversity, representation, and structural connectivity, indicating that current models still face challenges in generating diverse and highly - realistic content. Through these findings, the paper provides guidance for evaluating the application of other LLMs in synthetic - data generation and emphasizes the importance of considering fidelity and utility separately when evaluating synthetic data.