CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Erik Nijkamp,Hiroaki Hayashi,Caiming Xiong,Silvio Savarese,Yingbo Zhou
2023-07-12
Abstract:Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: <a class="link-external link-https" href="https://github.com/salesforce/CodeGen" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on how to train large - language models (LLMs) more efficiently to achieve program synthesis tasks. Specifically, the authors try to reduce training costs and improve model performance by unifying four key components: 1. **Model architecture**: Unify the encoder and decoder models into a single prefix - language model (Prefix - LM), hoping to simplify the model architecture while maintaining or improving performance. 2. **Learning method**: Unify causal language modeling, span corruption, and infilling into a simple learning algorithm to improve the efficiency of zero - sample learning and understanding tasks. 3. **Infilling sampling**: Explore the "free lunch" hypothesis, that is, to enable the model to have the ability of left - to - right and infilling sampling without increasing the computational cost. 4. **Data distribution**: Study the impact of the mixed distribution of natural language and programming languages on model performance, as well as the effect of multi - round training. Through these attempts, the authors hope to provide a general - purpose model that can perform well on a variety of synthesis and understanding tasks and can reduce training costs. In addition, they also hope to promote research and application in the community through open - source code and pre - trained models.