Abstract:This study evaluates the security of web application code generated by Large Language Models, analyzing 2,500 GPT-4 generated PHP websites. These were deployed in Docker containers and tested for vulnerabilities using a hybrid approach of Burp Suite active scanning, static analysis, and manual review. Our investigation focuses on identifying Insecure File Upload, SQL Injection, Stored XSS, and Reflected XSS in GPT-4 generated PHP code. This analysis highlights potential security risks and the implications of deploying such code in real-world scenarios. Overall, our analysis found 2,440 vulnerable parameters. According to Burp's Scan, 11.56% of the sites can be straight out compromised. Adding static scan results, 26% had at least one vulnerability that can be exploited through web interaction. Certain coding scenarios, like file upload functionality, are insecure 78% of the time, underscoring significant risks to software safety and security. To support further research, we have made the source codes and a detailed vulnerability record for each sample publicly available. This study emphasizes the crucial need for thorough testing and evaluation if generative AI technologies are used in software development.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to evaluate the security of PHP code generated by large - language models (LLMs), especially the potential vulnerabilities and limitations of GPT - 4 - generated PHP code in actual deployment. Specifically, the researchers address this problem in the following ways: 1. **Generate and test a large number of PHP websites**: - The researchers generated 2,500 PHP websites generated by GPT - 4 and deployed them in Docker containers. - These websites cover a variety of common development tasks, product types, and styles to ensure the diversity of the data set. 2. **Multi - stage vulnerability detection**: - **Dynamic scanning**: Use Burp Suite for active scanning to identify vulnerabilities such as SQL injection, stored cross - site scripting (Stored XSS), and reflected cross - site scripting (Reflected XSS). - **Static analysis**: Write Python scripts to check whether the file upload function lacks necessary extension verification and whether SQL queries use pre - prepared statements. - **Manual review**: Randomly select 50 samples for each vulnerability category for manual review to ensure the accuracy of the results of automatic tools. 3. **Result analysis and public data**: - The research results show that at least 11.16% of the generated websites have exploitable vulnerabilities, of which 78% of the file upload functions are insecure and 54.28% of the SQL queries do not use pre - prepared statements. - The researchers made all the generated source code and detailed vulnerability records public on GitHub to support further research. ### Main research questions - **RQ1**: How many vulnerabilities exist in the PHP code generated by GPT - 4 in a zero - sample setting? - The results show that at least 11.16% of the generated websites have exploitable vulnerabilities, especially the security problems in file upload functions and SQL queries are particularly prominent. - **RQ2**: Does the generated PHP code reach the complexity and realism suitable for actual deployment? - Manual review found that 66% of the samples are too simple, lacking key functions and best practices, and are not suitable for actual deployment. ### Summary This study emphasizes the importance of thorough testing and evaluation when using generative AI technology in software development, especially in terms of security. Although LLMs can generate code quickly, the code they generate may have serious security risks and needs to be treated with caution.

LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations

A New Approach to Web Application Security: Utilizing GPT Language Models for Source Code Inspection

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

How Well Do Large Language Models Serve as End-to-End Secure Code Producers?

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

Can Large Language Models Find And Fix Vulnerable Software?

RatGPT: Turning online LLMs into Proxies for Malware Attacks

Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

An Investigation into Misuse of Java Security APIs by Large Language Models

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads

"You still have to study" -- On the Security of LLM generated code

What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models

LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins

A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

How well does LLM generate security tests?