SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation

Xue Jiang,Yihong Dong,Zhi Jin,Ge Li
2024-03-24
Abstract:Although Large Language Models (LLMs) have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training samples available in practice lead to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training samples is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named SEED, which stands for Sample-Efficient adaptation with Error-Driven learning for code generation. SEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome its own shortcomings, thus achieving efficient learning. Specifically, SEED involves identifying error code generated by LLMs, employing Self-revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, SEED achieves superior performance with few training samples, showing an average relative improvement of 54.7% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, SEED consistently demonstrates strong performance across various LLMs, underscoring its generalizability.
Software Engineering,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered by large - language models (LLMs) in code - generation tasks in specific scenarios, especially when the number of training samples is limited. Although LLMs have made remarkable progress in code generation, their performance in specific scenarios is still not satisfactory. These specific scenarios usually require the adaptation of LLMs to meet specific needs, but due to the limited number of training samples available in practical applications, the code - generation performance is poor. Therefore, how to effectively use a small number of training samples to adapt LLMs has become a major challenge in the current code - generation field. To address this challenge, the author proposes a new method named SEED, that is, Sample - Efficient adaptation with Error - Driven learning for code generation. The core idea of SEED is to use the error - prone code generated by LLMs as a learning opportunity and overcome its own shortcomings through error correction, so as to achieve efficient adaptation. Specifically, the SEED process includes four steps: 1. **Error - code collection**: Identify and collect the error - prone code generated by LLMs, aiming to explore the weaknesses of LLMs. 2. **Automatic code revision**: Design a method called SELF - REVISE to automatically revise the error - prone code at a low cost by using the information in the original data set and code - execution feedback. 3. **Model optimization**: Use the revised code to optimize LLMs, making the model focus on learning the revisions of these key errors, thereby improving the learning efficiency. 4. **Iterative adaptation**: Adopt an iterative strategy, repeat the above three steps, and continuously optimize and improve the performance of LLMs. The experimental results show that, compared with the mainstream fine - tuning methods, SEED performs excellently in multiple code - generation benchmark tests, especially when the sample size is small, with an average relative improvement rate of 54.7%. In addition, SEED shows strong performance on different LLMs, proving its wide applicability.