Abstract:We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

What problem does this paper attempt to address?

The main problem this paper attempts to address is the development of a set of efficient and open Foundation Language Models (FLMs) that can achieve optimal performance under different scales of inference budgets. Specifically, the goals of the paper include: 1. **Model Performance Optimization**: By training on a large amount of data, the model aims to achieve or exceed the performance of existing top models in various benchmarks. For example, LLaMA-13B surpasses GPT-3 (175B parameters) in most benchmarks, while LLaMA-65B competes with top models like Chinchilla-70B and PaLM-540B. 2. **Dataset Openness**: Unlike many models that rely on proprietary datasets, LLaMA is trained only on publicly available data, making the model open-source and accessible for the research community. 3. **Inference Efficiency**: Choosing models that train faster and infer more efficiently given a performance target. Although larger models may be cheaper to train, smaller models can be more economical during inference. For example, while Hoffmann et al. suggest training a 10B parameter model with 200B tokens, the authors find that a 7B parameter model continues to improve in performance after using 1T tokens. 4. **Multi-task Understanding and Generation Capability**: Evaluating the model's performance on various tasks, including common sense reasoning, closed-book question answering, reading comprehension, mathematical reasoning, and code generation. Results show that LLaMA performs excellently on multiple tasks, especially in zero-shot and few-shot settings. 5. **Instruction Fine-tuning**: Further improving the model's performance on tasks like multi-task language understanding (MMLU) through brief instruction fine-tuning. Even without fine-tuning, LLaMA-65B can follow basic instructions, with performance further enhanced after fine-tuning. 6. **Bias, Toxicity, and Misinformation Assessment**: Evaluating whether the content generated by the model contains bias, toxicity, and misinformation to ensure the model's safety in practical applications. Through multiple standard benchmarks, the authors find that LLaMA performs better than other models in some aspects but still needs improvement in others. In summary, this paper aims to provide an efficient, open, and high-performing foundation language model through the development of the LLaMA model series, to advance research and applications in the field of natural language processing.

LLaMA: Open and Efficient Foundation Language Models

Code Llama: Open Foundation Models for Code

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Me LLaMA: Foundation Large Language Models for Medical Applications

The Llama 3 Herd of Models

LLM360: Towards Fully Transparent Open-Source LLMs

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Llemma: An Open Language Model For Mathematics

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

TinyLlama: An Open-Source Small Language Model

LLaMA Pro: Progressive LLaMA with Block Expansion

MaLA-500: Massive Language Adaptation of Large Language Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Large Language Models: A Survey

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

GLM-130B: An Open Bilingual Pre-trained Model

Gorilla: Large Language Model Connected with Massive APIs

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

xLAM: A Family of Large Action Models to Empower AI Agent Systems