NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

Junkai Chen,Zhenhao Li,Xing Hu,Xin Xia
2024-06-28
Abstract:Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The natural language description in the prompt is crucial for LLMs to comprehend users' requirements. Prior studies uncover that LLMs are sensitive to the changes in the prompts, including slight changes that look inconspicuous. However, the natural language descriptions often vary in real-world scenarios (e.g., different formats, grammar, and wording). Prior studies on the robustness of LLMs are often based on random perturbations and such perturbations may not actually happen. In this paper, we conduct a comprehensive study to investigate how are code LLMs robust to variations of natural language description in real-world scenarios. We summarize 18 categories of perturbations of natural language and 3 combinations of co-occurred categories based on our literature review and an online survey with practitioners. We propose an automated framework, NLPerturbator, which can perform perturbations of each category given a set of prompts. Through a series of experiments on code generation using six code LLMs, we find that the perturbed prompts can decrease the performance of code generation by a considerable margin (e.g., up to 21.2%, and 4.8% to 6.1% on average). Our study highlights the importance of enhancing the robustness of LLMs to real-world variations in the prompts, as well as the essentiality of attentively constructing the prompts.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to investigate the robustness of large language models (LLMs) when natural language descriptions change, particularly in the context of code generation tasks. Specifically, the paper focuses on the various changes that natural language descriptions may undergo in real-world scenarios (such as different formats, syntax, and wording), which may affect the accuracy of code generated by LLMs. ### Background and Motivation 1. **Application of LLMs in Code Generation**: - Large language models have achieved significant success in generating code based on natural language descriptions and have been integrated into open-source projects and commercial products to assist with daily programming activities. - Natural language descriptions are crucial for LLMs to understand user requirements. 2. **Changes in Natural Language Descriptions**: - In real-world scenarios, natural language descriptions may change due to different wording, syntax, format, or even spelling errors. - Previous research has found that LLMs are very sensitive to these subtle changes, and even a small change can lead to completely different results. 3. **Limitations of Existing Research**: - Previous robustness studies were mainly based on random perturbations, which may not occur in actual use. - Therefore, it is currently unclear how robust LLMs are to changes in natural language descriptions in real-world scenarios. ### Research Objectives 1. **Identify Categories of Natural Language Perturbations in Real-World Scenarios**: - Through literature review and online surveys, identify 18 categories of natural language perturbations that may occur in practice and summarize 3 common combination categories. 2. **Develop an Automated Framework**: - Propose an automated framework, NLPerturbator, that can perturb prompts based on the identified categories of perturbations. 3. **Evaluate the Robustness of Code Generation**: - Through a series of experiments, evaluate the performance degradation of six code generation LLMs under perturbed prompts. ### Main Contributions 1. **Propose the Automated Framework NLPerturbator**: - This framework can perturb prompts based on categories of natural language perturbations in real-world scenarios. 2. **Provide Manually Verified Datasets**: - Provide two manually verified datasets, HumanEval-R and MBPP-R, for studying the robustness of LLMs under natural language perturbations. 3. **Comprehensive Evaluation of LLM Robustness**: - Conduct robustness evaluations of six code generation LLMs and discuss directions for improving prompt engineering. ### Conclusion This study emphasizes the importance of enhancing the robustness of LLMs to changes in natural language in real-world scenarios and the necessity of carefully constructing prompts. Through this research, valuable guidance can be provided for the practical application of LLMs in code generation tasks.