SelfCodeAlign: Self-Alignment for Code Generation

Yuxiang Wei,Federico Cassano,Jiawei Liu,Yifeng Ding,Naman Jain,Zachary Mueller,Harm de Vries,Leandro von Werra,Arjun Guha,Lingming Zhang
2024-11-02
Abstract:Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.
Computation and Language,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to achieve self-alignment of large language models (LLMs) for code generation without the need for extensive human-labeled data or knowledge distillation. Specifically, the authors propose a method called **SelfCodeAlign**, which enhances the capabilities of code generation models by automatically generating high-quality instruction-response pairs, thereby enabling the models to better follow natural language instructions. ### Background and Challenges 1. **Importance of Instruction Tuning**: - Large language models (LLMs) perform well on code-related tasks, but to fully realize their potential, instruction tuning is often required, which involves further fine-tuning the model with high-quality instruction-response pairs. - Instruction tuning data typically comes from human annotation or is generated from stronger models through knowledge distillation, but these methods have high costs and licensing restrictions. 2. **Limitations of Existing Methods**: - Human-labeled data is expensive and time-consuming. - Knowledge distillation may violate terms of service and relies on stronger models, limiting its generalizability. - Some existing open-source code generation models either use proprietary data, do not disclose their instruction tuning strategies, or rely on knowledge distillation. ### Solution **SelfCodeAlign** is a fully transparent and permissively licensed self-alignment pipeline that generates high-quality instruction-response pairs without relying on extensive human-labeled data or knowledge distillation. The specific steps are as follows: 1. **Seed Code Snippet Collection**: - Extract high-quality seed code snippets from a large permissively licensed codebase (e.g., The Stack V1). - Ensure the quality and diversity of seed snippets through a series of filtering rules. 2. **Diverse Instruction Generation**: - Use a base model to generate diverse instructions from seed code snippets through in-context learning. - Extract code concepts and generate new coding tasks based on difficulty and category. 3. **Response Generation and Self-Verification**: - For each generated instruction, the base model generates multiple responses and creates test cases for each response. - Execute tests in a sandbox environment and select the responses that pass the tests as the final instruction-response pairs. ### Experimental Results - **Benchmarking**: - Evaluated on multiple code generation tasks, including function generation, class generation, data science programming, and code editing. - Models trained with SelfCodeAlign achieved significant performance improvements on benchmarks like HumanEval+ and MBPP+, notably achieving a 67.1% pass@1 score on HumanEval+, surpassing many other models, including larger ones like CodeLlama-70B-Instruct. - **Component Analysis**: - Detailed experiments validated the effectiveness of each component of SelfCodeAlign, including seed selection, concept generation, and execution filtering. ### Main Contributions 1. **Proposed a fully transparent and permissively licensed self-alignment pipeline** that enhances the performance of code generation models without relying on extensive human-labeled data or knowledge distillation. 2. **Generated a series of datasets** and trained multiple models on these datasets, all of which will be publicly released. 3. **Conducted comprehensive evaluations on various tasks**, demonstrating the effectiveness of SelfCodeAlign. 4. **Proved that training models on their own data distribution can be more effective than using a stronger but differently distributed teacher model**. 5. **Validated the positive contributions of each part of SelfCodeAlign through component analysis**. In summary, SelfCodeAlign provides an efficient and transparent method for self-alignment of code generation models, significantly enhancing their performance.