Abstract:Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.

What problem does this paper attempt to address?

The paper primarily focuses on the runtime performance issues of large language models (LLMs) during the pre-training, fine-tuning, and inference deployment stages. Specifically, the research aims to address the following key issues: 1. **Benchmarking and Configuration Selection**: Due to the enormous scale of LLMs and their high demand for computational resources, selecting the optimal solution under different hardware and software configurations becomes complex. Therefore, one of the paper's goals is to evaluate the performance of LLMs of different sizes (e.g., models with 700 million, 1.3 billion, and 7 billion parameters) on three different 8-GPU platforms (including high-performance GPUs like A800 and consumer-grade GPUs like RTX4090 and RTX3090) through benchmarking. It also analyzes the performance of these models when various optimization techniques (such as ZeRO, quantization, recomputation, FlashAttention, etc.) are enabled or disabled. 2. **Detailed Runtime Analysis of Submodules**: The paper not only focuses on the overall process performance but also delves into the various submodules of LLMs, including computational and communication operators, to provide detailed runtime analysis. This is crucial for understanding the effects of different optimization techniques and identifying potential directions for future performance optimization. 3. **Impact of Optimization Techniques**: The paper evaluates a range of optimization techniques (such as ZeRO memory optimization, quantization, activation recomputation, FlashAttention, etc.) on the pre-training, fine-tuning, and service deployment stages to help users better understand how to choose appropriate optimization methods to improve efficiency and reduce costs. 4. **End-to-End Performance Evaluation**: The paper conducts a comprehensive performance evaluation of pre-training frameworks (such as DeepSpeed and Megatron-LM), fine-tuning frameworks (such as LoRA and QLoRA), and inference service frameworks (such as vLLM, LightLLM, and TGI) to provide users with information about the performance of these systems in practical applications. 5. **Hardware and Software Co-Optimization**: The paper explores the overall system performance on different types of GPU servers (such as A800, RTX4090, and RTX3090) when combined with specific optimization techniques, and how to leverage these techniques to fully utilize hardware resources. In summary, the goal of this paper is to benchmark and analyze the runtime performance of LLMs at multiple levels (from macro to micro) to provide researchers and end-users with valuable insights on how to effectively configure and optimize LLMs.

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Inference Performance Optimization for Large Language Models on CPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Distributed Inference Performance Optimization for LLMs on CPUs

Evaluation of pre-training large language models on leadership-class supercomputers

Efficient and Economic Large Language Model Inference with Attention Offloading

Achieving Peak Performance for Large Language Models: A Systematic Review

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Search for Efficient Large Language Models

LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Optimizing Distributed Training on Frontier for Large Language Models

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

A Hardware Evaluation Framework for Large Language Model Inference

GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

All Language Models Large and Small

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache