Abstract:Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The natural language description in the prompt is crucial for LLMs to comprehend users' requirements. Prior studies uncover that LLMs are sensitive to the changes in the prompts, including slight changes that look inconspicuous. However, the natural language descriptions often vary in real-world scenarios (e.g., different formats, grammar, and wording). Prior studies on the robustness of LLMs are often based on random perturbations and such perturbations may not actually happen. In this paper, we conduct a comprehensive study to investigate how are code LLMs robust to variations of natural language description in real-world scenarios. We summarize 18 categories of perturbations of natural language and 3 combinations of co-occurred categories based on our literature review and an online survey with practitioners. We propose an automated framework, NLPerturbator, which can perform perturbations of each category given a set of prompts. Through a series of experiments on code generation using six code LLMs, we find that the perturbed prompts can decrease the performance of code generation by a considerable margin (e.g., up to 21.2%, and 4.8% to 6.1% on average). Our study highlights the importance of enhancing the robustness of LLMs to real-world variations in the prompts, as well as the essentiality of attentively constructing the prompts.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to investigate the robustness of large language models (LLMs) when natural language descriptions change, particularly in the context of code generation tasks. Specifically, the paper focuses on the various changes that natural language descriptions may undergo in real-world scenarios (such as different formats, syntax, and wording), which may affect the accuracy of code generated by LLMs. ### Background and Motivation 1. **Application of LLMs in Code Generation**: - Large language models have achieved significant success in generating code based on natural language descriptions and have been integrated into open-source projects and commercial products to assist with daily programming activities. - Natural language descriptions are crucial for LLMs to understand user requirements. 2. **Changes in Natural Language Descriptions**: - In real-world scenarios, natural language descriptions may change due to different wording, syntax, format, or even spelling errors. - Previous research has found that LLMs are very sensitive to these subtle changes, and even a small change can lead to completely different results. 3. **Limitations of Existing Research**: - Previous robustness studies were mainly based on random perturbations, which may not occur in actual use. - Therefore, it is currently unclear how robust LLMs are to changes in natural language descriptions in real-world scenarios. ### Research Objectives 1. **Identify Categories of Natural Language Perturbations in Real-World Scenarios**: - Through literature review and online surveys, identify 18 categories of natural language perturbations that may occur in practice and summarize 3 common combination categories. 2. **Develop an Automated Framework**: - Propose an automated framework, NLPerturbator, that can perturb prompts based on the identified categories of perturbations. 3. **Evaluate the Robustness of Code Generation**: - Through a series of experiments, evaluate the performance degradation of six code generation LLMs under perturbed prompts. ### Main Contributions 1. **Propose the Automated Framework NLPerturbator**: - This framework can perturb prompts based on categories of natural language perturbations in real-world scenarios. 2. **Provide Manually Verified Datasets**: - Provide two manually verified datasets, HumanEval-R and MBPP-R, for studying the robustness of LLMs under natural language perturbations. 3. **Comprehensive Evaluation of LLM Robustness**: - Conduct robustness evaluations of six code generation LLMs and discuss directions for improving prompt engineering. ### Conclusion This study emphasizes the importance of enhancing the robustness of LLMs to changes in natural language in real-world scenarios and the necessity of carefully constructing prompts. Through this research, valuable guidance can be provided for the practical application of LLMs in code generation tasks.

NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

Robustness of LLMs to Perturbations in Text

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models

RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Perplexed: Understanding When Large Language Models are Confused

Are Large Language Models Really Robust to Word-Level Perturbations?

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code

DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions

E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Robustness Testing of Language Understanding in Task-Oriented Dialog