UniCoder: Scaling Code Large Language Model via Universal Code

Tao Sun,Linzheng Chai,Jian Yang,Yuwei Yin,Hongcheng Guo,Jiaheng Liu,Bing Wang,Liqun Yang,Zhoujun Li

2024-06-24

Abstract:Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.

Computation and Language

What problem does this paper attempt to address?

The paper focuses on improving the performance of Large Language Models (LLMs) in code generation tasks, particularly in multi-language programming environments. The current methods heavily rely on Chain-of-Thought (CoT) prompts, which generate code by describing intermediate steps in natural language. However, this approach is not ideal for code translation or generation tasks because standard CoT does not align with the logical structure and expression format of code. The paper proposes "Universal Code" (UniCode) as an intermediate representation, which is a combination of algorithmic step descriptions with programming language conventions, including assignments, conditions, and loops. The authors collected an instruction dataset called UNICODER-INSTRUCT to train the UNICODER model. The model is fine-tuned through multi-task learning objectives, including zero-shot question generation, question-Universal Code generation, Universal Code-answer translation, and Universal code-of-Thought (UoT) tasks. Experimental results show that UNICODER outperforms previous methods in Python benchmarks (Humaneval and MBPP) as well as multi-language benchmarks (MultiPL-E) with significant performance improvements. Additionally, ablation studies validate the effectiveness of the proposed approach, and further discussions provide insights into its impact. The main contributions of the paper include defining language-independent Universal Code, creating the UNICODER-INSTRUCT dataset, and proposing the UNICODER model for multi-task learning using Universal Code.

UniCoder: Scaling Code Large Language Model via Universal Code

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Improving Natural Language Capability of Code Large Language Model

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming Tasks

Exploring and Unleashing the Power of Large Language Models in Automated Code Translation

Large Language Models as Code Executors: An Exploratory Study

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback

A Pair Programming Framework for Code Generation Via Multi-Plan Exploration and Feedback-Driven Refinement

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Multilingual Code Co-Evolution Using Large Language Models