Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Zhenlan Ji,Pingchuan Ma,Zongjie Li,Shuai Wang
2023-10-10
Abstract:While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation, where LLMs, deemed a complex and powerful black-box model, is instructed by a high-level natural language specification, namely a prompt, to generate code. Nevertheless, effectively evaluating and explaining the code generation capability of LLMs is inherently challenging, given the complexity of LLMs and the lack of transparency. Inspired by the recent progress in causality analysis and its application in software engineering, this paper launches a causality analysis-based approach to systematically analyze the causal relations between the LLM input prompts and the generated code. To handle various technical challenges in this study, we first propose a novel causal graph-based representation of the prompt and the generated code, which is established over the fine-grained, human-understandable concepts in the input prompts. The formed causal graph is then used to identify the causal relations between the prompt and the derived code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies. The results of these studies illustrate the potential of our technique to provide insights into LLM effectiveness, and aid end-users in understanding predictions. Additionally, we demonstrate that our approach provides actionable insights to improve the quality of the LLM-generated code by properly calibrating the prompt.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively evaluate and interpret the code - generation ability when generating code based on large - language models (LLMs). Specifically, the paper focuses on: 1. **Fluctuations in code - generation quality**: When generating code using natural - language specifications (i.e., prompt words), even with the same intention, different expressions can lead to significantly different code outputs. Such fluctuations make the code - generation behavior based on LLMs opaque and difficult to predict, hindering its wide adoption in practical applications. 2. **Lack of systematic evaluation and interpretation methods**: Although the research community has proposed some benchmarks to evaluate the code - generation ability of LLMs, such as CodeSearchNet and HumanEval, these benchmarks mainly focus on surface - level metrics (e.g., BLEU scores) or functional metrics (e.g., pass rates). However, they fail to capture the interaction between prompt words and generated code, which is the key to understanding the code - generation behavior based on LLMs. 3. **Absence of causal - relationship analysis**: Currently, no existing work systematically analyzes the impact of prompt words on the code generated by LLMs. Therefore, the paper introduces a causal - analysis method, aiming to establish the causal relationship between prompt words and generated code and provide a systematic and easy - to - understand explanation framework. To solve these problems, the paper proposes the following methods: - **Causal - graph representation**: First, a new causal - graph representation method is proposed to represent prompt words and generated code. These graphs are constructed based on fine - grained, human - understandable concepts in the input prompt. - **Causal - relationship identification**: Advanced causal - analysis algorithms (such as DiBS) are used to identify the causal relationships between these features and estimate the average treatment effect (ATE) of each rephrasing instruction on the generated code. - **Systematic adjustment of prompt words**: Diversified prompt words are generated through rephrasing techniques to systematically explore the prompt - word space in order to optimize the quality of code generation. Through these methods, the paper can not only provide in - depth understanding and interpretation of the code - generation ability of LLMs but also provide specific guidance for developers to help them adjust prompt words to generate high - quality code.