From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

Siddharth Samsi,Dan Zhao,Joseph McDonald,Baolin Li,Adam Michaleas,Michael Jones,William Bergeron,Jeremy Kepner,Devesh Tiwari,Vijay Gadepally
2023-10-05
Abstract:Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.
Computation and Language,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily explores the computational and energy consumption issues of large language models (LLMs) during the inference process and provides a quantitative analysis of these issues through experiments. Below is a summary of the core problems the paper attempts to address: 1. **Understanding the resource consumption of large language models**: - The paper focuses on the significant amount of time and computational resources, especially energy costs, required by large language models (such as those supporting applications like ChatGPT) during the inference process. - The research emphasizes the energy consumed during the actual use of these models, which is relatively less studied compared to the training phase. 2. **Evaluating the performance and energy consumption of LLMs of different scales**: - Using the latest large language model LLaMA developed by Meta AI as a case study, which has versions with different parameter scales (e.g., 700 million, 1.3 billion, 3.3 billion, 6.5 billion parameters). - The experiments focus on the largest version, the LLaMA 65B model, while also including smaller versions (7B and 13B) for benchmark comparison. 3. **Experimental setup and data analysis**: - Experiments were conducted on the MIT Supercloud high-performance computing system, equipped with various GPUs (including NVIDIA V100 and A100), to test the performance of models of different scales. - Two datasets (Alpaca and GSM8K) were used for evaluation, aiming to reflect the model's performance and resource usage across different types of tasks. - Detailed measurements were taken of the model's inference performance (such as words, tokens, and response rates), latency, and energy consumption. 4. **Impact of model scale and hardware configuration**: - The performance and energy consumption differences of different scales of LLaMA models under minimal hardware configurations were analyzed. - The performance differences when using V100 and A100 GPUs were compared, finding that for smaller models (7B and 13B), the A100 GPU showed better performance, but for the larger 65B model, this performance improvement was not significant. 5. **Energy consumption analysis**: - Provided data on energy consumption under different model scales, batch sizes, and shard numbers. - Analyzed the impact of changes in maximum generation length on energy consumption. - Explored the impact of GPU power limits on model inference time and energy consumption, with results indicating that appropriate power limits can effectively reduce energy consumption. In summary, the paper provides valuable reference data and insights for future research by deeply understanding the computational performance and energy consumption characteristics of large language models in practical applications through a series of experiments.