Abstract:Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation. However, there is still a gap between LLMs being capable coders and being top-tier software engineers. Based on the observation that top-level software engineers often ask clarifying questions to reduce ambiguity in both requirements and coding solutions, we argue that the same should be applied to LLMs for code generation tasks. In this work, we conducted an empirical study on the benchmark and analysis of the communication skills of LLMs for code generation. We define communication skills of LLMs as ``being able to ask clarifying questions when the description of the code generation problem has issues''. We created a new benchmark, HumanEvalComm, by modifying problem descriptions according to three issues: inconsistency, ambiguity, incompleteness. We defined new evaluation metrics such as Communication Rate and Good Question Rate, and then experimented on HumanEvalComm with different Code LLMs, and a new LLM agent approach, Okanagan, to identify and ask questions in ambiguous parts from code and descriptions for further refining the generated code. Finally, we discussed evaluation results by comparing Code LLMs and Okanagan with our findings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient communication ability of large - language models (LLMs) in code - generation tasks. Although LLMs have made remarkable progress in code generation, there is still a gap between them and top - level software engineers. Specifically, when the input problem descriptions are incomplete, inconsistent or ambiguous, LLMs usually do not ask clarifying questions to obtain more information but directly generate code, which may lead to low - quality or incorrect generated code. The paper explores this problem by defining the communication ability of LLMs as "being able to ask clarifying questions when there are problems with the problem description", and proposes a new benchmark test HumanEvalComm and an LLM - based agent Okanagan for evaluating and improving the communication ability of LLMs in code - generation tasks. ### Main contributions of the paper: 1. **Created a new benchmark test**: HumanEvalComm, which introduces ambiguity, inconsistency and incompleteness by manually modifying the original problem descriptions to evaluate the communication skills of LLMs. 2. **Proposed an LLM - based agent**: Okanagan, which has a multi - round structure and customized prompts and is able to ask clarifying questions in code - generation tasks, thereby improving the quality of code generation. 3. **Conducted the first systematic empirical study**: Evaluated the communication abilities of different LLMs and Okanagan on HumanEvalComm, and introduced two new evaluation metrics - Communication Rate and Good Question Rate - to effectively measure the communication skills of models. ### Main findings: - When the problem descriptions are manually modified to be ambiguous, inconsistent or incomplete, more than 60% of LLMs still directly generate code instead of asking questions. - Compared with the original LLMs, Okanagan significantly improves the communication rate and the good question rate by 59% and 5% respectively, thereby increasing the test pass rate and the Pass@1 metric by 25% and 15% respectively. ### Conclusion: By asking clarifying questions, LLMs can obtain necessary information more effectively and thus generate higher - quality code. Okanagan, as an LLM - based agent method, shows the potential in improving the communication ability of LLMs and provides a new direction for future code - generation tasks.

Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Large Language Models Should Ask Clarifying Questions to Increase Confidence in Generated Code

Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks

AgentBench: Evaluating LLMs as Agents

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

On Evaluating the Efficiency of Source Code Generated by LLMs

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Escalating LLM-based Code Translation Benchmarking into the Class-level Era

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

A Review on Code Generation with LLMs: Application and Evaluation

Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation