Abstract:Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on external resources. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the two major challenges faced by existing code - generation models during the training process: **code correctness** and **running efficiency**. Specifically: 1. **Code correctness problem**: - Although the existing Supervised Fine - Tuning (SFT) method can improve the overall quality of the generated code, it cannot effectively teach the model to preferentially select the correct solution in ambiguous situations. - This results in the model may produce undesirable outputs when generating code, especially when facing complex tasks. 2. **Running efficiency problem**: - Existing methods fail to effectively optimize the running efficiency of the generated code, which makes the generated code may be inferior in performance to hand - written code. To solve these problems, the author proposes the **CodeDPO** framework, which improves two key factors, code correctness and efficiency, by introducing Preference Learning into the code - generation model. Specific practices include: - **Automatic generation and verification mechanism**: CodeDPO uses a novel dataset construction method. By automatically generating and verifying code and test cases, it ensures that the generated code can pass more tests and is thus more likely to be correct. - **PageRank - inspired algorithm**: This algorithm iteratively updates the score of each code snippet based on the number of tests it passes and its reliability, and finally creates a code preference - optimized dataset based on correctness and efficiency. - **Flexibility and extensibility**: CodeDPO does not rely on external resources and can create diverse preference - optimized data through the automatic generation and verification mechanism, which is suitable for complex real - world scenarios. Through these improvements, CodeDPO shows significant performance improvements in multiple benchmark tests, especially in terms of code correctness and efficiency. ### Summary This paper aims to solve the deficiencies of existing code - generation models in terms of correctness and efficiency by proposing the CodeDPO framework, thereby improving the quality and performance of the generated code.

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency

Aligning CodeLLMs with Direct Preference Optimization

Effi-Code: Unleashing Code Efficiency in Language Models

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

Learning Code Preference via Synthetic Evolution

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

CodeT: Code Generation with Generated Tests

Evaluating Language Models for Efficient Code Generation

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Execution-based Code Generation using Deep Reinforcement Learning

CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation.

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback

Preference Optimization for Reasoning with Pseudo Feedback

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

CodePAD: Sequence-based Code Generation with Pushdown Automaton

DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation

Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization