AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}

Bin Lei,Yuchen Li,Qiuwu Chen

2024-05-23

Abstract:We introduce AutoCoder, the first Large Language Model to surpass GPT-4 Turbo (April 2024) and GPT-4o in pass@1 on the Human Eval benchmark test ($\mathbf{90.9\%}$ vs. $\mathbf{90.2\%}$). In addition, AutoCoder offers a more versatile code interpreter compared to GPT-4 Turbo and GPT-4o. It's code interpreter can install external packages instead of limiting to built-in packages. AutoCoder's training data is a multi-turn dialogue dataset created by a system combining agent interaction and external code execution verification, a method we term \textbf{\textsc{AIEV-Instruct}} (Instruction Tuning with Agent-Interaction and Execution-Verified). Compared to previous large-scale code dataset generation methods, \textsc{AIEV-Instruct} reduces dependence on proprietary large models and provides execution-validated code dataset. The code and the demo video is available in \url{

Software Engineering,Artificial Intelligence

What problem does this paper attempt to address?

This paper presents AutoCoder, a large-scale language model designed to surpass the performance of GPT-4 Turbo and GPT-4o in code generation tasks. What sets AutoCoder apart is its code interpreter that can install external packages, not just limited to built-in packages. The paper introduces a new method for annotating large-scale code instruction datasets called AIEV-INSTRUCT, which simulates the process of programmers building code and conducting unit tests based on project requirements to ensure dataset accuracy. AIEV-INSTRUCT consists of a teaching phase and a self-learning phase, reducing reliance on expensive closed-source models. In the teaching phase, a powerful teacher model is used to generate synthetic encoding instructions to fine-tune a smaller student model. In the self-learning phase, the model acts as a proxy for code annotation itself. AutoCoder achieves a Pass@1 score of 90.9% on the Human Eval benchmark, surpassing other top models. Furthermore, AutoCoder's code interpreter is more powerful and can handle commands involving external package installations, while other models like GPT-4o can only handle code with built-in packages. The paper also compares the performance of AutoCoder with other large-scale language models on different datasets, demonstrating its superiority in code generation tasks. Through AIEV-INSTRUCT, researchers created a high-quality code instruction dataset with 169K samples and trained models including AutoCoder (33B) and AutoCoder-S (6.7B). Experimental results show that AutoCoder performs exceptionally well on multiple code-related tasks.

AutoCoder: Enhancing Code Large Language Model with \textsc{AIEV-Instruct}

Large Language Models as Code Executors: An Exploratory Study

OpenAi's GPT4 as coding assistant

InstructCoder: Instruction Tuning Large Language Models for Code Editing

AICoderEval: Improving AI Domain Code Generation of Large Language Models

JumpCoder: Go Beyond Autoregressive Coder via Online Modification

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

SelfEvolve: A Code Evolution Framework via Large Language Models

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

AI-assisted coding: Experiments with GPT-4

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

UniCoder: Scaling Code Large Language Model via Universal Code

GPTutor: an open-source AI pair programming tool alternative to Copilot

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Code Generation Using Self-Interactive Assistant

A Systematic Evaluation of Large Language Models of Code

Deep-AutoCoder: Learning to Complete Code Precisely with Induced Code Tokens

CodeT5+: Open Code Large Language Models for Code Understanding and Generation