PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Kaijie Zhu,Jindong Wang,Jiaheng Zhou,Zichen Wang,Hao Chen,Yidong Wang,Linyi Yang,Wei Ye,Yue Zhang,Neil Zhenqiang Gong,Xing Xie
2024-07-16
Abstract:The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptRobust, a robustness benchmark designed to measure LLMs' resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. The adversarial prompts, crafted to mimic plausible user errors like typos or synonyms, aim to evaluate how slight deviations can affect LLM outcomes while maintaining semantic integrity. These prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,788 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets. Our findings demonstrate that contemporary LLMs are not robust to adversarial prompts. Furthermore, we present a comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users.
Computation and Language,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the robustness of large - language models (LLMs) against adversarial prompts. Specifically, the author introduced a benchmarking tool named PromptRobust, aiming to systematically measure and analyze the performance of LLMs when facing various adversarial prompts. ### Problem Background With the wide application of large - language models (LLMs) in multiple fields, especially in safety - critical and decision - support fields, it is crucial to ensure the robustness of these models against input perturbations. However, existing research mainly focuses on adversarial samples and ignores the impact of adversarial prompts. Adversarial prompts refer to inputs that may cause LLMs to generate incorrect responses by slightly modifying the original prompt (such as misspellings, synonym replacements, etc.). ### Core Problems of the Paper 1. **Robustness Evaluation**: Are current LLMs robust enough when facing adversarial prompts? 2. **Influencing Factors**: What factors lead to the vulnerability of LLMs to adversarial prompts? 3. **Improvement Strategies**: How to improve the robustness of LLMs against adversarial prompts? ### Main Contributions 1. **PromptRobust Benchmark**: Proposed a comprehensive benchmarking tool for evaluating the robustness of LLMs against different types of adversarial prompts. 2. **Comprehensive Evaluation and Analysis**: Through extensive experiments on 8 tasks and 13 datasets, revealed the performance of LLMs under different attacks and provided visual explanations and transferability analysis. 3. **Practical Guidance**: Provided practical suggestions for researchers and users to help them design more robust prompts. ### Experimental Methods - **Prompt Types**: Include four prompt types: zero - shot, few - shot, role - oriented, and task - oriented. - **Attack Types**: Cover four attack methods: character - level, lexical - level, sentence - level, and semantic - level. - **Evaluation Metric**: Introduced the Performance Drop Rate (PDR) as a unified evaluation metric to quantify the performance change of the model under adversarial prompts. ### Experimental Results - **Overall Lack of Robustness**: The results show that current LLMs generally lack robustness when facing adversarial prompts. In particular, lexical - level attacks are the most effective, with an average performance drop of 33%. - **Model Differences**: There are significant differences in the robustness of different LLMs to adversarial prompts. GPT - 4 and UL2 perform relatively well, while Vicuna shows high vulnerability. - **Transferability Analysis**: The transferability of adversarial prompts between different models is limited, indicating that adversarial prompts designed for a specific model are difficult to be directly applied to other models. Through the above research, the author emphasized the importance of evaluating and improving the robustness of LLMs against adversarial prompts and provided directions and suggestions for future research.