Abstract:Writing effective prompts for large language models (LLM) can be unintuitive and burdensome. In response, services that optimize or suggest prompts have emerged. While such services can reduce user effort, they also introduce a risk: the prompt provider can subtly manipulate prompts to produce heavily biased LLM responses. In this work, we show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept (e.g., a brand, political party, nation). We substantiate our observations through a user study, showing our adversarially perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2) push LLMs to recommend target concepts more often, and 3) make users more likely to notice target concepts, all without arousing suspicion. The practicality of this attack has the potential to undermine user autonomy. Among other measures, we recommend implementing warnings against using prompts from untrusted parties.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the optimized or recommended prompts provided by third - parties may be maliciously manipulated, causing large language models (LLMs) to generate responses with significant biases. Specifically, the authors studied whether by subtly replacing synonyms to manipulate prompts, it is possible to increase the likelihood of LLMs mentioning specific concepts (such as brands, political groups, countries, etc.) without arousing user suspicion. This attack method can not only affect the responses of LLMs but also may undermine the user's autonomy without the user's knowledge. ### Main research content - **Background and motivation**: With the development of LLMs, chatbots have become an indispensable part of users' digital experiences. However, effective prompts are often difficult to create, so there are many services for optimizing or recommending prompts. Although these services reduce the burden on users, they also introduce new risks: prompt providers can subtly manipulate prompts to make LLMs generate biased responses. - **Research methods**: Through experiments and user studies, the authors showed that by replacing synonyms, the probability of LLMs mentioning specific concepts can be significantly increased (up to 78%), and these manipulated prompts are undetectable to users. - **Experimental results**: The authors used models such as Llama2, Llama3, Llama3 - it (instruction - tuned), Gemma - it, etc. to conduct experiments and developed a dataset containing 524 prompts, covering two scenarios of shopping and social topics. The experimental results show that by replacing synonyms, the probability of LLMs mentioning target concepts can be significantly increased. - **User studies**: To verify the effectiveness and stealthiness of the attack, the authors conducted a user study. The results show that the manipulated prompts are not only undetectable to users but also can significantly increase the likelihood of users noticing the target concepts. ### Main contributions 1. **Defined a new threat model**: Proposed a new threat model, that is, malicious prompt providers manipulate prompts to make LLMs generate biased responses, thus affecting users. 2. **Collected a dataset**: Collected 524 prompts and their related target concepts to evaluate the attack effect. 3. **Proposed a synonym replacement method**: By replacing synonyms, the probability of LLMs mentioning target concepts can be significantly increased and is undetectable to users. 4. **Verified the transferability of the attack**: Demonstrated the transferability of the attack between different LLMs, and it is also effective even on API - only models. 5. **Verified the effectiveness of the attack through user studies**: Through user studies, verified the effectiveness and stealthiness of the synonym replacement attack in actual scenarios. ### Conclusion This paper reveals the potential risks that third - party prompt providers may bring and proposes a new attack method - synonym replacement, which can significantly increase the probability of LLMs mentioning specific concepts without arousing user suspicion. This finding is of great significance for improving the security and transparency of LLMs.

LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

How Susceptible are LLMs to Influence in Prompts?

Cognitive Bias in Decision-Making with LLMs

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

An LLM can Fool Itself: A Prompt-Based Adversarial Attack

ConfusionPrompt: Practical Private Inference for Online Large Language Models

PRSA: PRompt Stealing Attacks against Large Language Models

Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

The language of prompting: What linguistic properties make a prompt successful?

Learning from Contrastive Prompts: Automated Optimization and Adaptation

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Do LLMs exhibit human-like response biases? A case study in survey design

Imprompter: Tricking LLM Agents into Improper Tool Use

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

Social Bias Evaluation for Large Language Models Requires Prompt Variations

MaPPing Your Model: Assessing the Impact of Adversarial Attacks on LLM-based Programming Assistants