Abstract:Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

Magicoder: Empowering Code Generation with OSS-Instruct

Magicoder: Source Code Is All You Need

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

WizardLM: Empowering Large Language Models to Follow Complex Instructions

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Large Language Models as Code Executors: An Exploratory Study

CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

AI-assisted Code Authoring at Scale: Fine-tuning, deploying, and mixed methods evaluation

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

CodeACT: Code Adaptive Compute-efficient Tuning Framework for Code LLMs