Abstract:Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to various aspects of software development, including their suggested use for automated generation of unit tests, but while requiring additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without requiring additional training or manual effort. Concretely, we consider an approach where the LLM is provided with prompts that include the signature and implementation of a function under test, along with usage examples extracted from documentation. Furthermore, if a generated test fails, our approach attempts to generate a new test that fixes the problem by re-prompting the model with the failing test and error message. We implement our approach in <sc>TestPilot</sc>, an adaptive LLM-based test generation tool for JavaScript that automatically generates unit tests for the methods in a given project's API. We evaluate <sc>TestPilot</sc> using OpenAI's <italic>gpt3.5-turbo</italic> LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%. In contrast, the state-of-the feedback-directed JavaScript test generation technique, Nessie, achieves only 51.3% statement coverage and 25.6% branch coverage. Furthermore, experiments with excluding parts of the information included in the prompts show that all components contribute towards the generation of effective test suites. We also find that 92.8% of <sc>TestPilot</sc>'s generated tests have <inline-formula><tex-math notation="LaTeX">$\leq$</tex-math><alternatives><mml:math display="inline"><mml:mo>≤</mml:mo></mml:math><inline-graphic xlink:href="schaefer-ieq1-3334955.gif"/></alternatives></inline-formula> 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run <sc>TestPilot</sc> with two additional LLMs, OpenAI's older <italic>code-cushman-002</italic> LLM and <italic>StarCoder</italic>, an LLM for which the training process is publicly documented. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.

The JavaScript Package Selection Task: A Comparative Experiment Using an LLM-based Approach

Retrieving and Ranking Relevant JavaScript Technologies from Web Repositories

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning

LLMRS: Unlocking Potentials of LLM-Based Recommender Systems for Software Purchase

Beyond Utility: Evaluating LLM as Recommender

Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

JobRecoGPT -- Explainable job recommendations using LLMs

A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models

Comparing the Utility, Preference, and Performance of Course Material Search Functionality and Retrieval-Augmented Generation Large Language Model (RAG-LLM) AI Chatbots in Information-Seeking Tasks

LLaRA: Large Language-Recommendation Assistant

Web Application for Retrieval-Augmented Generation: Implementation and Testing

Adopting RAG for LLM-Aided Future Vehicle Design

Pistis-RAG: A Scalable Cascading Framework Towards Trustworthy Retrieval-Augmented Generation

LLaRA: Aligning Large Language Models with Sequential Recommenders.

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

LLMRec: Benchmarking Large Language Models on Recommendation Task

AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant

Enhancing Recommendation Diversity by Re-ranking with Large Language Models