Abstract:Recent developments in unsupervised representation learning have successfully established the concept of transfer learning in NLP. Mainly three forces are driving the improvements in this area of research: More elaborated architectures are making better use of contextual information. Instead of simply plugging in static pre-trained representations, these are learned based on surrounding context in end-to-end trainable models with more intelligently designed language modelling objectives. Along with this, larger corpora are used as resources for pre-training large language models in a self-supervised fashion which are afterwards fine-tuned on supervised tasks. Advances in parallel computing as well as in cloud computing, made it possible to train these models with growing capacities in the same or even in shorter time than previously established models. These three developments agglomerate in new state-of-the-art (SOTA) results being revealed in a higher and higher frequency. It is not always obvious where these improvements originate from, as it is not possible to completely disentangle the contributions of the three driving forces. We set ourselves to providing a clear and concise overview on several large pre-trained language models, which achieved SOTA results in the last two years, with respect to their use of new architectures and resources. We want to clarify for the reader where the differences between the models are and we furthermore attempt to gain some insight into the single contributions of lexical/computational improvements as well as of architectural changes. We explicitly do not intend to quantify these contributions, but rather see our work as an overview in order to identify potential starting points for benchmark comparisons. Furthermore, we tentatively want to point at potential possibilities for improvement in the field of open-sourcing and reproducible research.

Algorithmic progress in language models

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Language models scale reliably with over-training and on downstream tasks

Mind the Gap: Assessing Temporal Generalization in Neural Language Models

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

On the comparability of Pre-trained Language Models

On the State of the Art of Evaluation in Neural Language Models

PaLM: Scaling Language Modeling with Pathways

A Survey of Large Language Models

Evaluating Computational Language Models with Scaling Properties of Natural Language

On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey

Training Compute-Optimal Large Language Models

Deconstructing What Makes a Good Optimizer for Language Models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

Revisiting Neural Scaling Laws in Language and Vision

Evolution of Natural Language Processing Technology: Not Just Language Processing Towards General Purpose AI

Simple and Scalable Strategies to Continually Pre-train Large Language Models