LLM Whisperer: An Inconspicuous Attack to Bias LLM Responses

Weiran Lin,Anna Gerchanovsky,Omer Akgul,Lujo Bauer,Matt Fredrikson,Zifan Wang
2024-09-16
Abstract:Writing effective prompts for large language models (LLM) can be unintuitive and burdensome. In response, services that optimize or suggest prompts have emerged. While such services can reduce user effort, they also introduce a risk: the prompt provider can subtly manipulate prompts to produce heavily biased LLM responses. In this work, we show that subtle synonym replacements in prompts can increase the likelihood (by a difference up to 78%) that LLMs mention a target concept (e.g., a brand, political party, nation). We substantiate our observations through a user study, showing our adversarially perturbed prompts 1) are indistinguishable from unaltered prompts by humans, 2) push LLMs to recommend target concepts more often, and 3) make users more likely to notice target concepts, all without arousing suspicion. The practicality of this attack has the potential to undermine user autonomy. Among other measures, we recommend implementing warnings against using prompts from untrusted parties.
Cryptography and Security,Artificial Intelligence,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the optimized or recommended prompts provided by third - parties may be maliciously manipulated, causing large language models (LLMs) to generate responses with significant biases. Specifically, the authors studied whether by subtly replacing synonyms to manipulate prompts, it is possible to increase the likelihood of LLMs mentioning specific concepts (such as brands, political groups, countries, etc.) without arousing user suspicion. This attack method can not only affect the responses of LLMs but also may undermine the user's autonomy without the user's knowledge. ### Main research content - **Background and motivation**: With the development of LLMs, chatbots have become an indispensable part of users' digital experiences. However, effective prompts are often difficult to create, so there are many services for optimizing or recommending prompts. Although these services reduce the burden on users, they also introduce new risks: prompt providers can subtly manipulate prompts to make LLMs generate biased responses. - **Research methods**: Through experiments and user studies, the authors showed that by replacing synonyms, the probability of LLMs mentioning specific concepts can be significantly increased (up to 78%), and these manipulated prompts are undetectable to users. - **Experimental results**: The authors used models such as Llama2, Llama3, Llama3 - it (instruction - tuned), Gemma - it, etc. to conduct experiments and developed a dataset containing 524 prompts, covering two scenarios of shopping and social topics. The experimental results show that by replacing synonyms, the probability of LLMs mentioning target concepts can be significantly increased. - **User studies**: To verify the effectiveness and stealthiness of the attack, the authors conducted a user study. The results show that the manipulated prompts are not only undetectable to users but also can significantly increase the likelihood of users noticing the target concepts. ### Main contributions 1. **Defined a new threat model**: Proposed a new threat model, that is, malicious prompt providers manipulate prompts to make LLMs generate biased responses, thus affecting users. 2. **Collected a dataset**: Collected 524 prompts and their related target concepts to evaluate the attack effect. 3. **Proposed a synonym replacement method**: By replacing synonyms, the probability of LLMs mentioning target concepts can be significantly increased and is undetectable to users. 4. **Verified the transferability of the attack**: Demonstrated the transferability of the attack between different LLMs, and it is also effective even on API - only models. 5. **Verified the effectiveness of the attack through user studies**: Through user studies, verified the effectiveness and stealthiness of the synonym replacement attack in actual scenarios. ### Conclusion This paper reveals the potential risks that third - party prompt providers may bring and proposes a new attack method - synonym replacement, which can significantly increase the probability of LLMs mentioning specific concepts without arousing user suspicion. This finding is of great significance for improving the security and transparency of LLMs.