"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow

Manisha Mukherjee,Vincent J. Hellendoorn
2024-01-24
Abstract:Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
Computation and Language,Software Engineering
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper attempts to address several key issues in the application of large language models (LLMs) in the field of software engineering: 1. **Cost and Resource Constraints**: Large pre-trained models like GPT-3, while powerful, are expensive to train and deploy, and are often closed-source. This makes it difficult for academia and small research teams to access and use these models. 2. **Generality vs. Task-Specific Performance**: General large language models may perform poorly on specific domain tasks due to a lack of domain-specific data and training signals. For example, in code-related tasks on StackOverflow (SO), general models may not perform as well as models specifically trained on SO data. 3. **Model Size vs. Performance**: While large models perform well on many tasks, can smaller models achieve similar or even better performance on specific tasks? If so, this would significantly reduce the cost of using the models. ### Main Contributions of the Paper To address the above issues, the paper makes the following contributions: 1. **Model Training**: The authors trained two models based on the BERT architecture—SOBertBase (125 million parameters) and SOBertLarge (762 million parameters), pre-trained on StackOverflow data. The training process used large-scale computational resources but was relatively low-cost, at $374 and $1600 respectively. 2. **Downstream Task Evaluation**: The authors evaluated these two models on four StackOverflow-specific downstream tasks, including question quality prediction, closed question prediction, named entity recognition, and obsolete detection (a newly introduced task). The results show that SOBert models outperform existing baseline models on all four tasks, with even the smaller SOBertBase model often achieving state-of-the-art performance. 3. **Data Processing and Model Design**: The authors detailed the methods for data processing and model design, including the use of the SentencePiece tokenizer, retaining code blocks, and using a maximum sequence length of 2048. These methods help the model better understand and process the complex data on StackOverflow. ### Conclusion The paper demonstrates through experiments that even with limited budgets, it is possible to train powerful small to medium-sized language models by appropriately pre-training on domain-specific data. These models not only perform well on specific tasks but are also cost-effective and easy to deploy, providing a viable solution for academia and small research teams.