Abstract:Code generation aims to generate source code implementing human requirements illustrated with natural language specifications. With the rapid development of intelligent software engineering, automated code generation has become a hot research topic in both artificial intelligence and software engineering, and researchers have made significant achievements on code generation. More recently, large language models (LLMs) have demonstrated outstanding performance on code generation tasks, such as ChatGPT released by OpenAI presents the fantastic potential on automated code generation. However, the existing studies are limited to exploring LLMs' ability for generating code snippets to solve simple programming problems, the task of competition-level code generation has never been investigated. The specifications of the programming competition are always complicated and require the specific input/output format as well as the high-level algorithmic reasoning ability. In this study, we conduct the first large empirical study to investigate the zero-shot learning ability of ChatGPT for solving competition programming problems. Specifically, we warm up the design of prompts by using the Human-Eval dataset. Then, we apply the well-designed prompt to the competition-level code generation dataset, namely APPS, to further explore the effectiveness of using ChatGPT for solving competition problems. We collect ChatGPT's outputs on 5,000 code competition problems, the evaluation results show that it can successfully pass 25.4% test cases. By further feeding extra information (e.g, test failed information) to ChatGPT, we observe that ChatGPT has the potential to fix partial pass into a fully pass program. Moreover, we investigate the solutions generated by LLMs and the existing solutions, we find that it prefers to directly copy the code instead of re-write when facing more difficult problems. Finally, we evaluate the code quality generated by ChatGPT in terms of “code cleanness”, we observe that the generated codes are with small functions and file sizes, which are in line with the standard of clean code.

Benchmarking ChatGPT on Algorithmic Reasoning

ChatGPT for Programming Numerical Methods

Extending the Frontier of ChatGPT: Code Generation and Debugging

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking about

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models

Bayesian artificial brain with ChatGPT

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

ChatGPT Participates in a Computer Science Exam

A Survey on the Real Power of ChatGPT

In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination

Is ChatGPT the Ultimate Programming Assistant -- How far is it?

A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT.

ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor

Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence