Abstract:Large language models (LLMs) have exploded in popularity due to their new generative capabilities that go far beyond prior state-of-the-art. These technologies are increasingly being leveraged in various domains such as law, finance, and medicine. However, these models carry significant computational challenges, especially the compute and energy costs required for inference. Inference energy costs already receive less attention than the energy costs of training LLMs -- despite how often these large models are called on to conduct inference in reality (e.g., ChatGPT). As these state-of-the-art LLMs see increasing usage and deployment in various domains, a better understanding of their resource utilization is crucial for cost-savings, scaling performance, efficient hardware usage, and optimal inference strategies. In this paper, we describe experiments conducted to study the computational and energy utilization of inference with LLMs. We benchmark and conduct a preliminary analysis of the inference performance and inference energy costs of different sizes of LLaMA -- a recent state-of-the-art LLM -- developed by Meta AI on two generations of popular GPUs (NVIDIA V100 \& A100) and two datasets (Alpaca and GSM8K) to reflect the diverse set of tasks/benchmarks for LLMs in research and practice. We present the results of multi-node, multi-GPU inference using model sharding across up to 32 GPUs. To our knowledge, our work is the one of the first to study LLM inference performance from the perspective of computational and energy resources at this scale.

What problem does this paper attempt to address?

The paper primarily explores the computational and energy consumption issues of large language models (LLMs) during the inference process and provides a quantitative analysis of these issues through experiments. Below is a summary of the core problems the paper attempts to address: 1. **Understanding the resource consumption of large language models**: - The paper focuses on the significant amount of time and computational resources, especially energy costs, required by large language models (such as those supporting applications like ChatGPT) during the inference process. - The research emphasizes the energy consumed during the actual use of these models, which is relatively less studied compared to the training phase. 2. **Evaluating the performance and energy consumption of LLMs of different scales**: - Using the latest large language model LLaMA developed by Meta AI as a case study, which has versions with different parameter scales (e.g., 700 million, 1.3 billion, 3.3 billion, 6.5 billion parameters). - The experiments focus on the largest version, the LLaMA 65B model, while also including smaller versions (7B and 13B) for benchmark comparison. 3. **Experimental setup and data analysis**: - Experiments were conducted on the MIT Supercloud high-performance computing system, equipped with various GPUs (including NVIDIA V100 and A100), to test the performance of models of different scales. - Two datasets (Alpaca and GSM8K) were used for evaluation, aiming to reflect the model's performance and resource usage across different types of tasks. - Detailed measurements were taken of the model's inference performance (such as words, tokens, and response rates), latency, and energy consumption. 4. **Impact of model scale and hardware configuration**: - The performance and energy consumption differences of different scales of LLaMA models under minimal hardware configurations were analyzed. - The performance differences when using V100 and A100 GPUs were compared, finding that for smaller models (7B and 13B), the A100 GPU showed better performance, but for the larger 65B model, this performance improvement was not significant. 5. **Energy consumption analysis**: - Provided data on energy consumption under different model scales, batch sizes, and shard numbers. - Analyzed the impact of changes in maximum generation length on energy consumption. - Explored the impact of GPU power limits on model inference time and energy consumption, with results indicating that appropriate power limits can effectively reduce energy consumption. In summary, the paper provides valuable reference data and insights for future research by deeply understanding the computational performance and energy consumption characteristics of large language models in practical applications through a series of experiments.

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Analyzing the Energy and Accuracy of LLMs in Software Development

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

A Hardware Evaluation Framework for Large Language Model Inference

Inference Performance Optimization for Large Language Models on CPUs

Power Hungry Processing: Watts Driving the Cost of AI Deployment?

Evaluation of pre-training large language models on leadership-class supercomputers

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads

Inference Acceleration for Large Language Models on CPUs

Benchmarking Resource Usage for Efficient Distributed Deep Learning