Abstract:Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

Octopus v4: Graph of language models

Octopus: On-device language model for function calling of software APIs

Octopus v2: On-device language model for super agent

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models

Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

OMPGPT: A Generative Pre-trained Transformer Model for OpenMP

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow

Herd: Using multiple, smaller LLMs to match the performances of proprietary, large LLMs via an intelligent composer

Augmenting interpretable models with large language models during training

Large Language Models (LLMs): Deployment, Tokenomics and Sustainability

Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT -- A Text-to-SQL Parsing Comparison

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Large Language Models for Scholarly Ontology Generation: An Extensive Analysis in the Engineering Field

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

The Llama 3 Herd of Models

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?