A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

Yixi Wu,Pengfei He,Zehao Wang,Shaowei Wang,Yuan Tian,Tse-Hsun Chen
2024-09-26
Abstract:Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on general code generation without considering API-oriented code generation, i.e., generating code that invokes APIs from specific libraries. Given the growing demand for API-oriented code generation, there is a pressing need for a systematic and automated approach to evaluate LLM on API-oriented code generation. To address this gap, we propose AutoAPIEval, a lightweight and automated framework designed to evaluate the capabilities of LLMs in API-oriented code generation. Our framework works with any library that provides API documentation and focuses on two unit tasks: API recommendation and code example generation, along with four metrics to evaluate the generated APIs and code examples, such as the proportion of incorrect API recommendations for Task 1, and the proportion of code examples where no specific API is invoked and uncompilable/unexecutable code examples for Task 2. In addition, we conducted a case study on three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) and Java Runtime Environment 8 to demonstrate the framework's effectiveness. Our findings reveal substantial variability in LLM performance across tasks, with ChatGPT adhering better to instructions, while sharing similar effectiveness in code example generation with its counterparts (i.e., MagiCoder and DeekSeek Coder). We also identify key factors associated with code quality, such as API popularity and model confidence, and build classifiers that achieve high accuracy in detecting incorrect API recommendations and erroneous code examples. Retrieval-augmented generation enhances the quality of code generated by LLMs, though its effectiveness varies across different LLMs.
Software Engineering,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the quality of API - oriented code generation in large - language models (LLMs). Although existing benchmarks mainly focus on general code generation, API - oriented code generation, that is, generating code by calling APIs from specific libraries, is rarely considered. With the increasing demand for API - oriented code generation, there is an urgent need for a systematic and automated evaluation method to assess the performance of LLMs in this area. To this end, the authors propose a lightweight automated framework named AutoAPIEval for evaluating the capabilities of LLMs in API - oriented code generation. ### Specific problems solved by the paper: 1. **Lack of systematic evaluation methods**: Existing evaluation methods mainly focus on general code generation and overlook the special requirements of API - oriented code generation. 2. **Limitations of manual testing**: Existing research usually relies on manually created test cases, which limits its scalability and scope of application. 3. **Quality evaluation of API - oriented code generation**: A method is needed to evaluate the quality of generated API recommendations and code examples, including error rates, compilation failure rates, and execution failure rates, etc. ### Solutions: - **Propose the AutoAPIEval framework**: This framework can automatically and systematically evaluate the performance of LLMs in API - oriented code generation. - **Design two unit tasks**: - **API recommendation**: Given a library, query the LLM to recommend a list of APIs in each class. - **Code example generation**: Given an API, query the LLM to generate the corresponding code example. - **Define four evaluation metrics**: - **Proportion of incorrect API recommendations**: The proportion of recommended APIs that do not exist in the specified library. - **Proportion of code examples that do not call a specific API**: The proportion of generated code examples that do not call the specified API. - **Proportion of code examples that cannot be compiled**: The proportion of generated code examples that cannot be compiled. - **Proportion of code examples that cannot be executed**: The proportion of generated code examples that can be compiled but cannot be executed. ### Case studies: - **Dataset**: Use the API documentation of Java Runtime Environment 8 (JRE 8), which contains 217 packages and 2,397 classes. - **Evaluation models**: Select three LLMs (ChatGPT, MagiCoder, and DeepSeek Coder) for evaluation. - **Research questions**: - **Quality evaluation**: Evaluate the quality of API recommendations and code examples generated by LLMs. - **Error analysis**: Analyze the types of errors that occur in generated API recommendations and code examples. - **Factor analysis**: Explore the factors related to the quality of API - oriented code generation. - **Error mitigation**: Study whether retrieval - augmented generation (RAG) can reduce errors and improve code quality. Through these methods, the paper aims to fill the gaps in existing research, provide a systematic and automated evaluation tool, and help researchers and developers better understand and improve the performance of LLMs in API - oriented code generation.