Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Longteng Zhang,Xiang Liu,Zeyu Li,Xinglin Pan,Peijie Dong,Ruibo Fan,Rui Guo,Xin Wang,Qiong Luo,Shaohuai Shi,Xiaowen Chu
2023-12-01
Abstract:Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
Performance,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily focuses on the runtime performance issues of large language models (LLMs) during the pre-training, fine-tuning, and inference deployment stages. Specifically, the research aims to address the following key issues: 1. **Benchmarking and Configuration Selection**: Due to the enormous scale of LLMs and their high demand for computational resources, selecting the optimal solution under different hardware and software configurations becomes complex. Therefore, one of the paper's goals is to evaluate the performance of LLMs of different sizes (e.g., models with 700 million, 1.3 billion, and 7 billion parameters) on three different 8-GPU platforms (including high-performance GPUs like A800 and consumer-grade GPUs like RTX4090 and RTX3090) through benchmarking. It also analyzes the performance of these models when various optimization techniques (such as ZeRO, quantization, recomputation, FlashAttention, etc.) are enabled or disabled. 2. **Detailed Runtime Analysis of Submodules**: The paper not only focuses on the overall process performance but also delves into the various submodules of LLMs, including computational and communication operators, to provide detailed runtime analysis. This is crucial for understanding the effects of different optimization techniques and identifying potential directions for future performance optimization. 3. **Impact of Optimization Techniques**: The paper evaluates a range of optimization techniques (such as ZeRO memory optimization, quantization, activation recomputation, FlashAttention, etc.) on the pre-training, fine-tuning, and service deployment stages to help users better understand how to choose appropriate optimization methods to improve efficiency and reduce costs. 4. **End-to-End Performance Evaluation**: The paper conducts a comprehensive performance evaluation of pre-training frameworks (such as DeepSpeed and Megatron-LM), fine-tuning frameworks (such as LoRA and QLoRA), and inference service frameworks (such as vLLM, LightLLM, and TGI) to provide users with information about the performance of these systems in practical applications. 5. **Hardware and Software Co-Optimization**: The paper explores the overall system performance on different types of GPU servers (such as A800, RTX4090, and RTX3090) when combined with specific optimization techniques, and how to leverage these techniques to fully utilize hardware resources. In summary, the goal of this paper is to benchmark and analyze the runtime performance of LLMs at multiple levels (from macro to micro) to provide researchers and end-users with valuable insights on how to effectively configure and optimize LLMs.