Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Jie JW Wu,Fatemeh H. Fard
2024-06-01
Abstract:Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We define communication skills of LLMs as ``being able to ask clarifying questions when the description of the code generation problem has issues''. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues: inconsistency, ambiguity, incompleteness. We defined new evaluation metrics such as Communication Rate and Good Question Rate, and then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, Okanagan, to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. Finally, we discussed evaluation results by comparing Code LLMs and Okanagan with our findings.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient communication ability of large - language models (LLMs) in code - generation tasks. Although LLMs have made remarkable progress in code generation, there is still a gap between them and top - level software engineers. Specifically, when the input problem descriptions are incomplete, inconsistent or ambiguous, LLMs usually do not ask clarifying questions to obtain more information but directly generate code, which may lead to low - quality or incorrect generated code. The paper explores this problem by defining the communication ability of LLMs as "being able to ask clarifying questions when there are problems with the problem description", and proposes a new benchmark test HumanEvalComm and an LLM - based agent Okanagan for evaluating and improving the communication ability of LLMs in code - generation tasks. ### Main contributions of the paper: 1. **Created a new benchmark test**: HumanEvalComm, which introduces ambiguity, inconsistency and incompleteness by manually modifying the original problem descriptions to evaluate the communication skills of LLMs. 2. **Proposed an LLM - based agent**: Okanagan, which has a multi - round structure and customized prompts and is able to ask clarifying questions in code - generation tasks, thereby improving the quality of code generation. 3. **Conducted the first systematic empirical study**: Evaluated the communication abilities of different LLMs and Okanagan on HumanEvalComm, and introduced two new evaluation metrics - Communication Rate and Good Question Rate - to effectively measure the communication skills of models. ### Main findings: - When the problem descriptions are manually modified to be ambiguous, inconsistent or incomplete, more than 60% of LLMs still directly generate code instead of asking questions. - Compared with the original LLMs, Okanagan significantly improves the communication rate and the good question rate by 59% and 5% respectively, thereby increasing the test pass rate and the Pass@1 metric by 25% and 15% respectively. ### Conclusion: By asking clarifying questions, LLMs can obtain necessary information more effectively and thus generate higher - quality code. Okanagan, as an LLM - based agent method, shows the potential in improving the communication ability of LLMs and provides a new direction for future code - generation tasks.