Abstract:Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to achieve self-alignment of large language models (LLMs) for code generation without the need for extensive human-labeled data or knowledge distillation. Specifically, the authors propose a method called **SelfCodeAlign**, which enhances the capabilities of code generation models by automatically generating high-quality instruction-response pairs, thereby enabling the models to better follow natural language instructions. ### Background and Challenges 1. **Importance of Instruction Tuning**: - Large language models (LLMs) perform well on code-related tasks, but to fully realize their potential, instruction tuning is often required, which involves further fine-tuning the model with high-quality instruction-response pairs. - Instruction tuning data typically comes from human annotation or is generated from stronger models through knowledge distillation, but these methods have high costs and licensing restrictions. 2. **Limitations of Existing Methods**: - Human-labeled data is expensive and time-consuming. - Knowledge distillation may violate terms of service and relies on stronger models, limiting its generalizability. - Some existing open-source code generation models either use proprietary data, do not disclose their instruction tuning strategies, or rely on knowledge distillation. ### Solution **SelfCodeAlign** is a fully transparent and permissively licensed self-alignment pipeline that generates high-quality instruction-response pairs without relying on extensive human-labeled data or knowledge distillation. The specific steps are as follows: 1. **Seed Code Snippet Collection**: - Extract high-quality seed code snippets from a large permissively licensed codebase (e.g., The Stack V1). - Ensure the quality and diversity of seed snippets through a series of filtering rules. 2. **Diverse Instruction Generation**: - Use a base model to generate diverse instructions from seed code snippets through in-context learning. - Extract code concepts and generate new coding tasks based on difficulty and category. 3. **Response Generation and Self-Verification**: - For each generated instruction, the base model generates multiple responses and creates test cases for each response. - Execute tests in a sandbox environment and select the responses that pass the tests as the final instruction-response pairs. ### Experimental Results - **Benchmarking**: - Evaluated on multiple code generation tasks, including function generation, class generation, data science programming, and code editing. - Models trained with SelfCodeAlign achieved significant performance improvements on benchmarks like HumanEval+ and MBPP+, notably achieving a 67.1% pass@1 score on HumanEval+, surpassing many other models, including larger ones like CodeLlama-70B-Instruct. - **Component Analysis**: - Detailed experiments validated the effectiveness of each component of SelfCodeAlign, including seed selection, concept generation, and execution filtering. ### Main Contributions 1. **Proposed a fully transparent and permissively licensed self-alignment pipeline** that enhances the performance of code generation models without relying on extensive human-labeled data or knowledge distillation. 2. **Generated a series of datasets** and trained multiple models on these datasets, all of which will be publicly released. 3. **Conducted comprehensive evaluations on various tasks**, demonstrating the effectiveness of SelfCodeAlign. 4. **Proved that training models on their own data distribution can be more effective than using a stronger but differently distributed teacher model**. 5. **Validated the positive contributions of each part of SelfCodeAlign through component analysis**. In summary, SelfCodeAlign provides an efficient and transparent method for self-alignment of code generation models, significantly enhancing their performance.

SelfCodeAlign: Self-Alignment for Code Generation

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

InstructCoder: Instruction Tuning Large Language Models for Code Editing

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

CodecLM: Aligning Language Models with Tailored Synthetic Data

Evaluating and Aligning CodeLLMs on Human Preference

ACECode: A Reinforcement Learning Framework for Aligning Code Efficiency and Correctness in Code Language Models

Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code Large Language Models

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

OctoPack: Instruction Tuning Code Large Language Models

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

CITING: Large Language Models Create Curriculum for Instruction Tuning

Self-alignment with instruction backtranslation

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs