AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}

Bin Lei,Yuchen Li,Qiuwu Chen
2024-05-23
Abstract:We introduce AutoCoder, the first Large Language Model to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test ($\mathbf{90.9\%}$ vs. $\mathbf{90.2\%}$). In addition, AutoCoder offers a more versatile code interpreter compared to GPT-4 Turbo and GPT-4o. It's code interpreter can install external packages instead of limiting to built-in packages. AutoCoder's training data is a multi-turn dialogue dataset created by a system combining agent interaction and external code execution verification, a method we term \textbf{\textsc{AIEV-Instruct}} (Instruction Tuning with Agent-Interaction and Execution-Verified). Compared to previous large-scale code dataset generation methods, \textsc{AIEV-Instruct} reduces dependence on proprietary large models and provides execution-validated code dataset. The code and the demo video is available in \url{
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
This paper presents AutoCoder, a large-scale language model designed to surpass the performance of GPT-4 Turbo and GPT-4o in code generation tasks. What sets AutoCoder apart is its code interpreter that can install external packages, not just limited to built-in packages. The paper introduces a new method for annotating large-scale code instruction datasets called AIEV-INSTRUCT, which simulates the process of programmers building code and conducting unit tests based on project requirements to ensure dataset accuracy. AIEV-INSTRUCT consists of a teaching phase and a self-learning phase, reducing reliance on expensive closed-source models. In the teaching phase, a powerful teacher model is used to generate synthetic encoding instructions to fine-tune a smaller student model. In the self-learning phase, the model acts as a proxy for code annotation itself. AutoCoder achieves a Pass@1 score of 90.9% on the Human Eval benchmark, surpassing other top models. Furthermore, AutoCoder's code interpreter is more powerful and can handle commands involving external package installations, while other models like GPT-4o can only handle code with built-in packages. The paper also compares the performance of AutoCoder with other large-scale language models on different datasets, demonstrating its superiority in code generation tasks. Through AIEV-INSTRUCT, researchers created a high-quality code instruction dataset with 169K samples and trained models including AutoCoder (33B) and AutoCoder-S (6.7B). Experimental results show that AutoCoder performs exceptionally well on multiple code-related tasks.