LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations

Rebeka Tóth,Tamas Bisztray,László Erdodi
2024-05-21
Abstract:This study evaluates the security of web application code generated by Large Language Models, analyzing 2,500 GPT-4 generated PHP websites. These were deployed in Docker containers and tested for vulnerabilities using a hybrid approach of Burp Suite active scanning, static analysis, and manual review. Our investigation focuses on identifying Insecure File Upload, SQL Injection, Stored XSS, and Reflected XSS in GPT-4 generated PHP code. This analysis highlights potential security risks and the implications of deploying such code in real-world scenarios. Overall, our analysis found 2,440 vulnerable parameters. According to Burp's Scan, 11.56% of the sites can be straight out compromised. Adding static scan results, 26% had at least one vulnerability that can be exploited through web interaction. Certain coding scenarios, like file upload functionality, are insecure 78% of the time, underscoring significant risks to software safety and security. To support further research, we have made the source codes and a detailed vulnerability record for each sample publicly available. This study emphasizes the crucial need for thorough testing and evaluation if generative AI technologies are used in software development.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to evaluate the security of PHP code generated by large - language models (LLMs), especially the potential vulnerabilities and limitations of GPT - 4 - generated PHP code in actual deployment. Specifically, the researchers address this problem in the following ways: 1. **Generate and test a large number of PHP websites**: - The researchers generated 2,500 PHP websites generated by GPT - 4 and deployed them in Docker containers. - These websites cover a variety of common development tasks, product types, and styles to ensure the diversity of the data set. 2. **Multi - stage vulnerability detection**: - **Dynamic scanning**: Use Burp Suite for active scanning to identify vulnerabilities such as SQL injection, stored cross - site scripting (Stored XSS), and reflected cross - site scripting (Reflected XSS). - **Static analysis**: Write Python scripts to check whether the file upload function lacks necessary extension verification and whether SQL queries use pre - prepared statements. - **Manual review**: Randomly select 50 samples for each vulnerability category for manual review to ensure the accuracy of the results of automatic tools. 3. **Result analysis and public data**: - The research results show that at least 11.16% of the generated websites have exploitable vulnerabilities, of which 78% of the file upload functions are insecure and 54.28% of the SQL queries do not use pre - prepared statements. - The researchers made all the generated source code and detailed vulnerability records public on GitHub to support further research. ### Main research questions - **RQ1**: How many vulnerabilities exist in the PHP code generated by GPT - 4 in a zero - sample setting? - The results show that at least 11.16% of the generated websites have exploitable vulnerabilities, especially the security problems in file upload functions and SQL queries are particularly prominent. - **RQ2**: Does the generated PHP code reach the complexity and realism suitable for actual deployment? - Manual review found that 66% of the samples are too simple, lacking key functions and best practices, and are not suitable for actual deployment. ### Summary This study emphasizes the importance of thorough testing and evaluation when using generative AI technology in software development, especially in terms of security. Although LLMs can generate code quickly, the code they generate may have serious security risks and needs to be treated with caution.