Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

Minda Li,Bhaskar Krishnamachari

2024-11-12

Abstract:ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

This paper attempts to solve the following three main problems: 1. **How efficient is ChatGPT in solving LeetCode programming problems at different difficulty levels?** - Researchers evaluated the ability of ChatGPT's GPT - 3.5 - turbo model to solve programming problems of different difficulties (easy, medium, difficult) on the LeetCode platform. Specifically, they tested the following hypothesis: - **Hypothesis 1**: As the problem difficulty increases, the number of problems ChatGPT can solve will decrease. - The results showed that the GPT - 3.5 - turbo model successfully solved 92% of easy problems, 79% of medium problems, and 51% of difficult problems. 2. **Can Prompt Engineering improve ChatGPT's programming performance?** - Researchers used three methods to improve ChatGPT's performance: - **Chain - of - Thought Prompting**: Let ChatGPT generate pseudo - code first and then write the actual program. - **Providing failed test cases**: Feed the failed test cases back to ChatGPT to help it improve the solution. - **Switching to a more advanced model (such as GPT - 4)**: Use a more advanced model to compare performance differences. - **Hypothesis 2**: Prompt Engineering can improve ChatGPT's performance, and the effect is most significant on easy problems, while the benefit diminishes on complex problems. - The results showed that Chain - of - Thought Prompting improved by 29% on easy problems, while providing failed test cases provided greater improvement (38 - 60%) on medium and difficult problems, and switching to GPT - 4 also brought significant improvement (33 - 58%). 3. **How does ChatGPT perform in different programming languages?** - Researchers selected five programming languages (Java, C++, Elixir, Erlang, Racket) to evaluate ChatGPT's performance, using Python as a baseline for comparison. - **Hypothesis 3**: ChatGPT performs better in mainstream programming languages (such as Python, Java, C++) than in uncommon languages (such as Elixir, Erlang, Racket). - The results showed that ChatGPT solved 70% of the problems in Python, 50% of the problems in Java, approximately 50% of the problems in C++, but failed to solve any problems in Elixir, Erlang, and Racket. ### Summary This paper, through empirical analysis, evaluated ChatGPT's performance in solving programming problems at different difficulty levels and explored the impact of prompt engineering and programming languages on the model's performance. The research results verified all three hypotheses and revealed ChatGPT's advantages in simple tasks and limitations in complex tasks.

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

Extending the Frontier of ChatGPT: Code Generation and Debugging

A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT.

Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

Comparing large language models and human programmers for generating programming code

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Is ChatGPT the Ultimate Programming Assistant -- How far is it?

ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course

Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation

Improving ChatGPT Prompt for Code Generation

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

ChatGPT for Programming Numerical Methods

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

An empirical study of ChatGPT-3.5 on question answering and code maintenance

ChatGPT may Pass the Bar Exam soon, but has a Long Way to Go for the LexGLUE benchmark

Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances

Analyzing ChatGPT's Aptitude in an Introductory Computer Engineering Course