Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

Minda Li,Bhaskar Krishnamachari
2024-11-12
Abstract:ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the following three main problems: 1. **How efficient is ChatGPT in solving LeetCode programming problems at different difficulty levels?** - Researchers evaluated the ability of ChatGPT's GPT - 3.5 - turbo model to solve programming problems of different difficulties (easy, medium, difficult) on the LeetCode platform. Specifically, they tested the following hypothesis: - **Hypothesis 1**: As the problem difficulty increases, the number of problems ChatGPT can solve will decrease. - The results showed that the GPT - 3.5 - turbo model successfully solved 92% of easy problems, 79% of medium problems, and 51% of difficult problems. 2. **Can Prompt Engineering improve ChatGPT's programming performance?** - Researchers used three methods to improve ChatGPT's performance: - **Chain - of - Thought Prompting**: Let ChatGPT generate pseudo - code first and then write the actual program. - **Providing failed test cases**: Feed the failed test cases back to ChatGPT to help it improve the solution. - **Switching to a more advanced model (such as GPT - 4)**: Use a more advanced model to compare performance differences. - **Hypothesis 2**: Prompt Engineering can improve ChatGPT's performance, and the effect is most significant on easy problems, while the benefit diminishes on complex problems. - The results showed that Chain - of - Thought Prompting improved by 29% on easy problems, while providing failed test cases provided greater improvement (38 - 60%) on medium and difficult problems, and switching to GPT - 4 also brought significant improvement (33 - 58%). 3. **How does ChatGPT perform in different programming languages?** - Researchers selected five programming languages (Java, C++, Elixir, Erlang, Racket) to evaluate ChatGPT's performance, using Python as a baseline for comparison. - **Hypothesis 3**: ChatGPT performs better in mainstream programming languages (such as Python, Java, C++) than in uncommon languages (such as Elixir, Erlang, Racket). - The results showed that ChatGPT solved 70% of the problems in Python, 50% of the problems in Java, approximately 50% of the problems in C++, but failed to solve any problems in Elixir, Erlang, and Racket. ### Summary This paper, through empirical analysis, evaluated ChatGPT's performance in solving programming problems at different difficulty levels and explored the impact of prompt engineering and programming languages on the model's performance. The research results verified all three hypotheses and revealed ChatGPT's advantages in simple tasks and limitations in complex tasks.