"You still have to study" -- On the Security of LLM generated code

Stefan Goetz,Andreas Schaad
2024-08-13
Abstract:We witness an increasing usage of AI-assistants even for routine (classroom) programming tasks. However, the code generated on basis of a so called "prompt" by the programmer does not always meet accepted security standards. On the one hand, this may be due to lack of best-practice examples in the training data. On the other hand, the actual quality of the programmers prompt appears to influence whether generated code contains weaknesses or not. In this paper we analyse 4 major LLMs with respect to the security of generated code. We do this on basis of a case study for the Python and Javascript language, using the MITRE CWE catalogue as the guiding security definition. Our results show that using different prompting techniques, some LLMs initially generate 65% code which is deemed insecure by a trained security engineer. On the other hand almost all analysed LLMs will eventually generate code being close to 100% secure with increasing manual guidance of a skilled engineer.
Software Engineering,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies in the security of code generated by large language models (LLM). Specifically, the researchers are concerned with: 1. **Security issues of code generated by AI assistants**: Although AI assistants (such as GitHub Copilot, ChatGPT, etc.) can assist programmers in completing programming tasks, the generated code does not always meet security standards. This may be due to the lack of best - practice examples in the training data or the low - quality "prompts" provided by users. 2. **The impact of different prompt techniques on code security**: The researchers analyzed the security of code generated in Python and JavaScript languages by four major large language models (LLM), namely ChatGPT, Copilot, CodeWhisperer, and CodeLlama. They used the MITRE CWE catalogue as a guiding security definition to evaluate how different prompt techniques affect the security of the generated code. 3. **The role of manual guidance**: Research shows that through different prompt techniques and gradually increasing manual guidance, 65% of the code initially generated by some LLM was considered unsafe, but under the guidance of skilled engineers, almost all of the analyzed LLM can finally generate nearly 100% secure code. ### Formulas and Symbols Some key concepts and formulas involved in discussing code security can be represented in Markdown format as follows: - **CWE (Common Weakness Enumeration)**: A standard used to describe software security weaknesses. \[ CWE=\{w_1, w_2,\ldots, w_n\} \] - **Prevention of SQL injection attacks**: Use prepared statements to prevent SQL injection. \[ \text{Prepared Statement}=\text{SQL Query}+\text{Parameterized Inputs} \] ### Summary The core issue of this paper is to explore how to improve the security of code generated by LLM through improved prompt techniques. The research results show that through carefully designed prompts and gradually increasing human intervention, the security of code generated by LLM can be significantly improved.