CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Erik Nijkamp,Hiroaki Hayashi,Caiming Xiong,Silvio Savarese,Yingbo Zhou

2023-07-12

Abstract:Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: <a class="link-external link-https" href="https://github.com/salesforce/CodeGen" rel="external noopener nofollow">this https URL</a>.

Machine Learning

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on how to train large - language models (LLMs) more efficiently to achieve program synthesis tasks. Specifically, the authors try to reduce training costs and improve model performance by unifying four key components: 1. **Model architecture**: Unify the encoder and decoder models into a single prefix - language model (Prefix - LM), hoping to simplify the model architecture while maintaining or improving performance. 2. **Learning method**: Unify causal language modeling, span corruption, and infilling into a simple learning algorithm to improve the efficiency of zero - sample learning and understanding tasks. 3. **Infilling sampling**: Explore the "free lunch" hypothesis, that is, to enable the model to have the ability of left - to - right and infilling sampling without increasing the computational cost. 4. **Data distribution**: Study the impact of the mixed distribution of natural language and programming languages on model performance, as well as the effect of multi - round training. Through these attempts, the authors hope to provide a general - purpose model that can perform well on a variety of synthesis and understanding tasks and can reduce training costs. In addition, they also hope to promote research and application in the community through open - source code and pre - trained models.

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Multi-Programming Language Ensemble for Code Generation in Large Language Model

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Large Language Models in Computer Science Education: A Systematic Literature Review

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Improving Code Generation by Training with Natural Language Feedback

From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Large Language Models Meet NL2Code: A Survey

A Survey on Large Language Models for Code Generation

Large Language Model-Aware In-Context Learning for Code Generation

Automatically Generating CS Learning Materials with Large Language Models

Evolving Code with A Large Language Model

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Improving Natural Language Capability of Code Large Language Model

VeriGen: A Large Language Model for Verilog Code Generation