Alec Radford,Christopher Hesse,Jeff Wu,Pranav Shyam,Gretchen Krueger,Sam McCandlish,T. Henighan,R. Child,S. Gray,Mark Chen,Amanda Askell,B. Chess,Prafulla Dhariwal,I. Sutskever,Eric Sigler,Arvind Neelakantan,Tom B. Brown,Ariel Herbert-Voss,Christopher Berner,Clemens Winter,A. Ramesh,Girish Sastry,Jack Clark,Benjamin Mann,Dario Amodei,Sandhini Agarwal,Nick Ryder,Daniel M. Ziegler,Melanie Subbiah,Ma-teusz Litwin,J. Kaplan

Abstract:Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Language Models are Few-shot Multilingual Learners

Few-shot Learning with Multilingual Language Models

Language Models are Few-Shot Learners

Multilingual Few-Shot Learning via Language Model Retrieval

Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems

mGPT: Few-Shot Learners Go Multilingual

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Language Models are Unsupervised Multitask Learners

True Few-Shot Learning with Language Models

GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

On the Multilingual Capabilities of Very Large-Scale English Language Models

Generating Training Data with Language Models: Towards Zero-Shot Language Understanding

Large Language Models are Few-Shot Clinical Information Extractors

Task Contamination: Language Models May Not Be Few-Shot Anymore

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet?

Language Models for Text Classification: Is In-Context Learning Enough?

Calibrate Before Use: Improving Few-Shot Performance of Language Models

The unreasonable effectiveness of few-shot learning for machine translation

Language models are better than humans at next-token prediction