PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery,Sharan Narang,Jacob Devlin,Maarten Bosma,Gaurav Mishra,Adam Roberts,Paul Barham,Hyung Won Chung,Charles Sutton,Sebastian Gehrmann,Parker Schuh,Kensen Shi,Sasha Tsvyashchenko,Joshua Maynez,Abhishek Rao,Parker Barnes,Yi Tay,Noam Shazeer,Vinodkumar Prabhakaran,Emily Reif,Nan Du,Ben Hutchinson,Reiner Pope,James Bradbury,Jacob Austin,Michael Isard,Guy Gur-Ari,Pengcheng Yin,Toju Duke,Anselm Levskaya,Sanjay Ghemawat,Sunipa Dev,Henryk Michalewski,Xavier Garcia,Vedant Misra,Kevin Robinson,Liam Fedus,Denny Zhou,Daphne Ippolito,David Luan,Hyeontaek Lim,Barret Zoph,Alexander Spiridonov,Ryan Sepassi,David Dohan,Shivani Agrawal,Mark Omernick,Andrew M. Dai,Thanumalayan Sankaranarayana Pillai,Marie Pellat,Aitor Lewkowycz,Erica Moreira,Rewon Child,Oleksandr Polozov,Katherine Lee,Zongwei Zhou,Xuezhi Wang,Brennan Saeta,Mark Diaz,Orhan Firat,Michele Catasta,Jason Wei,Kathy Meier-Hellstern,Douglas Eck,Jeff Dean,Slav Petrov,Noah Fiedel

DOI: https://doi.org/10.48550/arXiv.2204.02311

2022-04-05

Computation and Language

Abstract:Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

What problem does this paper attempt to address?

PaLM: Scaling Language Modeling with Pathways

Language models scale reliably with over-training and on downstream tasks

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

PaLM 2 Technical Report

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Language Models are Few-Shot Learners

PaLM-E: An Embodied Multimodal Language Model

Training Compute-Optimal Large Language Models

Efficient Large-Scale Language Model Training on GPU Clusters

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

A Survey of Large Language Models

AudioPaLM: A Large Language Model That Can Speak and Listen

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Optimizing Distributed Training on Frontier for Large Language Models