Abstract:While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation, where LLMs, deemed a complex and powerful black-box model, is instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by the recent progress in causality analysis and its application in software engineering, this paper launches a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively evaluate and interpret the code - generation ability when generating code based on large - language models (LLMs). Specifically, the paper focuses on: 1. **Fluctuations in code - generation quality**: When generating code using natural - language specifications (i.e., prompt words), even with the same intention, different expressions can lead to significantly different code outputs. Such fluctuations make the code - generation behavior based on LLMs opaque and difficult to predict, hindering its wide adoption in practical applications. 2. **Lack of systematic evaluation and interpretation methods**: Although the research community has proposed some benchmarks to evaluate the code - generation ability of LLMs, such as CodeSearchNet and HumanEval, these benchmarks mainly focus on surface - level metrics (e.g., BLEU scores) or functional metrics (e.g., pass rates). However, they fail to capture the interaction between prompt words and generated code, which is the key to understanding the code - generation behavior based on LLMs. 3. **Absence of causal - relationship analysis**: Currently, no existing work systematically analyzes the impact of prompt words on the code generated by LLMs. Therefore, the paper introduces a causal - analysis method, aiming to establish the causal relationship between prompt words and generated code and provide a systematic and easy - to - understand explanation framework. To solve these problems, the paper proposes the following methods: - **Causal - graph representation**: First, a new causal - graph representation method is proposed to represent prompt words and generated code. These graphs are constructed based on fine - grained, human - understandable concepts in the input prompt. - **Causal - relationship identification**: Advanced causal - analysis algorithms (such as DiBS) are used to identify the causal relationships between these features and estimate the average treatment effect (ATE) of each rephrasing instruction on the generated code. - **Systematic adjustment of prompt words**: Diversified prompt words are generated through rephrasing techniques to systematically explore the prompt - word space in order to optimize the quality of code generation. Through these methods, the paper can not only provide in - depth understanding and interpretation of the code - generation ability of LLMs but also provide specific guidance for developers to help them adjust prompt words to generate high - quality code.

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Benchmarking Causal Study to Interpret Large Language Models for Source Code

An Empirical Study of Code Generation Errors made by Large Language Models

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

Evaluating Large Language Models in Class-Level Code Generation

Test-Case-Driven Programming Understanding in Large Language Models for Better Code Generation

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Exploring Multi-Lingual Bias of Large Code Models in Code Generation

A Survey on Evaluating Large Language Models in Code Generation Tasks

A Survey on Large Language Models for Code Generation

XPrompt:Explaining Large Language Model's Generation via Joint Prompt Attribution

Self-planning Code Generation with Large Language Models

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?

The Behavior of Large Language Models When Prompted to Generate Code Explanations

CodeJudge: Evaluating Code Generation with Large Language Models

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Improving Natural Language Capability of Code Large Language Model

Where Do Large Language Models Fail When Generating Code?