Magicoder: Empowering Code Generation with OSS-Instruct

Yuxiang Wei,Zhe Wang,Jiawei Liu,Yifeng Ding,Lingming Zhang
2024-06-07
Abstract:We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.
Computation and Language,Artificial Intelligence,Software Engineering
What problem does this paper attempt to address?
The paper mainly addresses two core issues: 1. **Proposes a new method** (OSS-INSTRUCT) to generate high-quality code instruction data for training large language models (LLMs), particularly for code generation tasks. This method leverages open-source code snippets to inspire LLMs to generate diverse, realistic, and controllable programming problems and their solutions. 2. **Develops a series of fully open-source large-scale language models named Magicoder**, which perform excellently in various code generation benchmarks, even surpassing existing top models like ChatGPT in some cases. The Magicoder models are trained using the OSS-INSTRUCT method and, in some variants, further enhance performance by incorporating the Evol-Instruct method. In short, this paper improves the performance of LLMs in code generation tasks by introducing an innovative data generation technique (OSS-INSTRUCT) and develops a series of high-performance open-source code generation models (Magicoder) based on this technology.