Abstract:There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of coding tasks with zero equal test output across different requests is 75.76%, 51.00%, and 47.56% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the non - determinism of large language models (LLMs) in code - generation tasks. Specifically, the paper focuses on the non - deterministic behavior of ChatGPT during code generation, that is, for the same prompt, ChatGPT may generate very different codes. This non - determinism affects the correctness and consistency of the generated code, weakens developers' trust in LLMs, and leads to low reproducibility of LLM - based research papers. ### Main Research Questions 1. **How non - deterministic is ChatGPT when generating code under default settings?** - Evaluate the severity of non - determinism by comparing the semantic, syntactic, and structural similarities of the codes generated multiple times. 2. **What is the influence of the temperature parameter on non - determinism?** - Temperature is a hyper - parameter that controls the randomness of the generation results. Study the changes in non - determinism under different temperature values. 3. **How similar are the code candidates generated in different predictions to those within the same prediction?** - Compare the similarities between the code candidates generated in different predictions and those within the same prediction. 4. **Which types of programming tasks have higher non - determinism?** - Study the correlation between the characteristics of programming tasks (such as description length, difficulty, etc.) and non - determinism. 5. **How does the non - determinism of GPT - 4 compare to that of GPT - 3.5?** - Compare the degrees of non - determinism of GPT - 3.5 and GPT - 4 in code generation. 6. **How do different prompt engineering strategies affect non - determinism?** - Study the effects of different prompt strategies (such as chain - of - thought, concise code requests) on non - determinism. ### Research Background and Significance In recent years, large language models have made significant progress in software engineering tasks (especially code generation). However, the output results of these models are highly unstable and may return very different codes in different requests. This non - determinism not only affects the quality of the generated code but also reduces developers' trust in these models and makes LLM - based research difficult to reproduce. Therefore, understanding and solving this problem is the key to ensuring the reliability and consistency of LLMs in practical applications. ### Experimental Design The researchers selected three widely - used code - generation benchmark datasets (CodeContests, APPS, HumanEval) and evaluated the non - determinism of ChatGPT under different settings through a series of experiments. The experiments include: - **Code Generation**: Use the same prompt to let ChatGPT generate code five times. - **Similarity Analysis**: Compare the generated codes in terms of semantics, syntax, and structure. - **Parameter Adjustment**: Study the influence of the temperature parameter on non - determinism. - **Feature Analysis**: Explore the relationship between the characteristics of programming tasks and non - determinism. ### Conclusions The study found that under default settings, the non - determinism problem of ChatGPT is very serious, especially in code - generation tasks. Even if the temperature is set to 0, non - determinism cannot be completely eliminated. In addition, longer programming instruction descriptions tend to generate codes with lower similarity and more errors. Different prompt engineering strategies also affect the degree of non - determinism. In general, this paper reveals the non - determinism threat in LLM code generation and calls on researchers to fully consider this problem in their future work to improve the reliability and scientific nature of research conclusions.

An Empirical Study of the Non-determinism of ChatGPT in Code Generation

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Where Are Large Language Models for Code Generation on GitHub?

Optimizing Large Language Model Hyperparameters for Code Generation

Examination of Code generated by Large Language Models

Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation

A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT.

ChatGPT Code Detection: Techniques for Uncovering the Source of Code

Ocassionally Secure: A Comparative Analysis of Code Generation Assistants

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

ChatGPT-Generated Code Assignment Detection Using Perplexity of Large Language Models (Student Abstract)

Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Experimenting with ChatGPT for Spreadsheet Formula Generation: Evidence of Risk in AI Generated Spreadsheets

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

Analyzing Large language models chatbots: An experimental approach using a probability test

Judgments of research co-created by generative AI: experimental evidence