Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval

Jiexin Wang,Xitong Luo,Liuwen Cao,Hongkui He,Hailin Huang,Jiayuan Xie,Adam Jatowt,Yi Cai
2024-07-04
Abstract:Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is the security vulnerabilities present in large language models (LLMs) during code generation and code repair processes. Although these models excel in improving development efficiency, they are often trained on unfiltered open-source code repositories, which may inadvertently propagate security vulnerabilities. While current research has explored the security of code generation, there is still a lack of comprehensive evaluation and resolution of the security features of these models. This paper aims to conduct a comprehensive evaluation of the security of existing models in code generation and repair processes by introducing a carefully designed dataset, CodeSecEval, and proposes effective strategies to mitigate security vulnerabilities. This will raise awareness of these issues within the software engineering community and promote the development of safer and more reliable LLM applications. Specifically, the researchers address this problem through the following approaches: 1. **Constructing the CodeSecEval dataset**: This dataset contains 180 samples covering 44 critical vulnerability types. Each sample includes executable secure and insecure code examples as well as test cases, enabling automatic and precise evaluation of code security. 2. **Evaluating the performance of existing models**: By extensively evaluating 7 state-of-the-art code generation LLMs, the researchers reveal the phenomenon that these models generally overlook security issues when handling code generation and repair. 3. **Proposing mitigation strategies**: Based on the evaluation results, the researchers propose strategies that integrate vulnerability awareness information and explanations of insecure code to significantly enhance security in the code generation and repair processes. 4. **Validating the effectiveness of the strategies**: Through experiments, the effectiveness of the aforementioned strategies is validated, demonstrating that these methods can significantly reduce security vulnerabilities. In summary, this research not only provides new tools and methods for evaluating and enhancing the security of LLMs in code generation and repair tasks but also offers valuable recommendations for future LLM training and deployment, contributing to safer programming practices.