Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions

Vincent Li,Nick Doiron

2023-10-01

Abstract:Unit testing is a commonly-used approach in software engineering to test the correctness and robustness of written code. Unit tests are tests designed to test small components of a codebase in isolation, such as an individual function or method. Although unit tests have historically been written by human programmers, recent advancements in AI, particularly LLMs, have shown corresponding advances in automatic unit test generation. In this study, we explore the effect of different prompts on the quality of unit tests generated by Code Interpreter, a GPT-4-based LLM, on Python functions provided by the Quixbugs dataset, and we focus on prompting due to the ease with which users can make use of our findings and observations. We find that the quality of the generated unit tests is not sensitive to changes in minor details in the prompts provided. However, we observe that Code Interpreter is often able to effectively identify and correct mistakes in code that it writes, suggesting that providing it runnable code to check the correctness of its outputs would be beneficial, even though we find that it is already often able to generate correctly-formatted unit tests. Our findings suggest that, when prompting models similar to Code Interpreter, it is important to include the basic information necessary to generate unit tests, but minor details are not as important.

Software Engineering,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the quality of unit tests generated by large language models (LLMs) through different prompt strategies. Specifically, the research focuses on using the GPT - 4 - based Code Interpreter to generate high - quality unit tests for Python functions in the Quixbugs dataset. The research focus is on exploring how different types of prompts affect the quality of the generated unit tests, including but not limited to factors such as the format of the code context, the number of example unit tests provided, different selections of target functions, and natural - language comments included in the prompts. Through experiments, the paper found that although the quality of the generated unit tests is affected to a certain extent by the details of the prompts, these effects are not significant. This indicates that the Code Interpreter is highly robust to minor changes in the prompts. As long as basic information such as function signatures and necessary background descriptions are provided, relatively high - quality unit tests can be generated. In addition, the research also points out that a prompt method that directly provides the code context, does not include additional natural - language comments, and provides two output examples usually can obtain better performance metrics. However, these differences are not large. Therefore, in practical applications, users can flexibly adjust the prompt strategies according to their own needs without paying too much attention to details.

Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions

Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM

Promptly: Using Prompt Problems to Teach Learners How to Effectively Utilize AI Code Generators

Can Developers Prompt? A Controlled Experiment for Code Documentation Generation

Prompt Problems: A New Programming Exercise for the Generative AI Era

Exploring the Curious Case of Code Prompts

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Evaluating and Improving ChatGPT for Unit Test Generation

Prompting Techniques for Secure Code Generation: A Systematic Investigation

Integrating Natural Language Prompting Tasks in Introductory Programming Courses

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs

On the Evaluation of Large Language Models in Unit Test Generation

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation

Validating LLM-Generated Programs with Metamorphic Prompt Testing

How Beginning Programmers and Code LLMs (Mis)read Each Other

No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Prompting with Pseudo-Code Instructions

Improving ChatGPT Prompt for Code Generation

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs