ToolCoder: Teach Code Generation Models to use API search tools

Kechi Zhang,Huangzhao Zhang,Ge Li,Jia Li,Zhuo Li,Zhi Jin
2023-09-11
Abstract:Automatically generating source code from natural language descriptions has been a growing field of research in recent years. However, current large-scale code generation models often encounter difficulties when selecting appropriate APIs for specific contexts. These models may generate APIs that do not meet requirements or refer to non-existent APIs in third-party libraries, especially for lesser-known or private libraries. Inspired by the process of human developers using tools to search APIs, we propose ToolCoder, a novel approach that integrates API search tools with existing models to assist in code generation and API selection. To teach our model to use tools, we introduce an automated data annotation method using ChatGPT to add tool usage information into the source code data and fine-tune code generation models. During inference, we integrate API search tools into the generation process so that our model can automatically use the search tool to get suggestions when selecting an API. Our experimental results demonstrate that ToolCoder exhibits excellent performance and generalization across five public and private library code generation benchmarks, with at least 6.21\% improvement on average pass@1 metrics and 9.64\% improvement on average pass@10 metrics compared to state-of-the-art methods. Furthermore, we show that our relatively small ToolCoder model is comparable to one of the current best models, GPT-3.5, highlighting the potential of incorporating programming tools into the code generation process.
Software Engineering
What problem does this paper attempt to address?
The paper primarily addresses the existing challenges in the field of automatic code generation, particularly the difficulties large code generation models encounter when selecting appropriate APIs (Application Programming Interfaces) in specific contexts. Current models often face issues in API selection, such as choosing APIs that do not meet the requirements or referencing APIs that do not exist in third-party libraries, especially when dealing with less well-known or private libraries. To solve the aforementioned problems, the paper proposes the ToolCoder method. This is an innovative approach that integrates an API search tool into existing code generation models to assist in the code generation and API selection process. Specifically, ToolCoder includes the following aspects: 1. **Automatic Data Annotation**: Researchers developed an automated data annotation method that uses the powerful ChatGPT to annotate source code data, adding tool usage information. This helps train the model to learn how to use these tools. 2. **Parameter-Efficient Fine-Tuning**: Researchers adopted a parameter-efficient fine-tuning method to improve training efficiency, requiring only a small number of parameters to be adjusted to complete model training. This makes effective training possible even on consumer-grade GPUs. 3. **Inference Enhancement**: During the inference phase, researchers integrated the API search tool into the model's generation process, enabling the model to automatically use the search tool to get suggestions when selecting APIs. Through experimental evaluation, ToolCoder performed excellently in code generation benchmarks for multiple public and private libraries, showing significant improvements over existing technologies. Additionally, the method demonstrated good generalizability, effectively improving code generation quality across different types of libraries. In summary, the paper proposes a new solution, ToolCoder, aimed at improving the performance of code generation models in API selection by integrating an API search tool, thereby enhancing the quality and accuracy of the generated code.