Yue Liu,Thanh Le-Cong,Ratnadira Widyasari,Chakkrit Tantithamthavorn,Li Li,Xuan-Bach D. Le,David Lo
Abstract:Since its introduction in November 2022, ChatGPT has rapidly gained popularity due to its remarkable ability in language understanding and human-like responses. ChatGPT, based on GPT-3.5 architecture, has shown great promise for revolutionizing various research fields, including code generation. However, the reliability and quality of code generated by ChatGPT remain unexplored, raising concerns about potential risks associated with the widespread use of ChatGPT-driven code generation. In this paper, we systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT’s self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to systematically study the quality issues of the code generated by ChatGPT. Specifically, the paper has three objectives:
1. **Analyze the correctness of ChatGPT in code - generation tasks**:
- Research the performance of the code generated by ChatGPT in different programming tasks, including the influence of factors such as task difficulty, programming languages, task introduction time, and program size on the effectiveness of ChatGPT.
2. **Identify and characterize the quality problems of ChatGPT - generated code**:
- Through experiments and static analysis tools, identify common quality problems in the code generated by ChatGPT, such as compilation errors, runtime errors, output errors, code style, and maintainability issues.
3. **Explore how to mitigate these problems**:
- Through different prompting strategies, use static analysis tools and runtime error feedback to guide ChatGPT to fix the quality problems in the generated code and evaluate its self - repair ability.
### Experimental design and results
To achieve the above objectives, the author conducted the following experiments:
- **Data collection**:
- Collected 2,033 programming tasks from LeetCode, covering tasks of different difficulty levels (easy, medium, difficult).
- For each task, ChatGPT generated code in two languages, Java and Python.
- **Performance evaluation**:
- Use LeetCode's test suite to evaluate whether the code generated by ChatGPT can pass all test cases.
- The results show that among the 4,066 generated programs, 2,756 programs are considered correct, 1,082 programs have output errors, and 177 programs contain compilation or runtime errors.
- **Code quality analysis**:
- Use static analysis tools (such as Pylint, Flake8, PMD, CheckStyle) to further analyze other characteristics of the generated code, such as code style and maintainability.
- It was found that 1,930 code fragments generated by ChatGPT have maintainability problems.
- **Self - repair ability evaluation**:
- Through different repair prompts, evaluate the performance of ChatGPT in fixing code quality problems.
- Experiments show that ChatGPT can partially solve these problems and improve code quality by more than 20%, but there is still room for improvement.
### Main findings
1. **Performance**:
- ChatGPT performs best on simple tasks, and its performance gradually declines as the task difficulty increases.
- For example, for simple Python tasks, ChatGPT has a pass rate of 89%, while for difficult tasks, the pass rate drops to 40%.
2. **Common quality problems**:
- The generated code often has compilation and runtime errors, output errors, code style, and maintainability problems.
- Even if the test cases are passed, the generated code may still have style and maintainability problems. For example, 53% of Java code and 37% of Python code have these problems.
3. **Self - repair ability**:
- ChatGPT can partially fix code quality problems by receiving feedback from static analysis tools and runtime errors.
- The repair effect varies depending on the feedback information, programming languages, and code quality problems.
### Summary
This paper, through a systematic study of the code generated by ChatGPT, reveals its reliability and quality status in code - generation tasks, identifies common quality problems, and provides suggestions for improving code quality. These findings provide valuable references for future research and development.