Abstract:The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.

What problem does this paper attempt to address?

The paper primarily explores the impact of hyperparameters on the inference performance of large language models (LLMs), with a particular focus on throughput (the number of tokens generated per unit time). The authors analyze the performance of two popular inference libraries—vLLM and HuggingFace Pipelines—under different hyperparameter settings. The main research points in the paper include: 1. **Impact of Hyperparameters on Inference Performance**: The authors evaluated multiple large language models and observed how adjusting different hyperparameters (such as the number of GPUs used, batch size, etc.) affects inference performance. The results show that the throughput landscape is irregular and has noticeable peaks, indicating that hyperparameter optimization is necessary to achieve optimal performance. 2. **Impact of GPU Quantity on Online Inference**: Online inference refers to scenarios where each query contains only one input. The paper studied how different numbers of GPUs affect throughput and found that increasing the number of GPUs can improve throughput, but the growth is not linear, and there is an optimal configuration. 3. **Impact of Batch Size on Batch Inference**: Unlike online inference, batch inference allows processing multiple inputs at once. The paper examined how changes in batch size affect throughput and found that the choice of batch size is crucial; too large a batch size may lead to memory overflow, while too small a batch size may result in resource wastage. 4. **Comparison of Inference Performance Across Different GPU Models**: The paper also compared the inference performance when using different GPU models (e.g., Nvidia A100 vs. Nvidia V100) and pointed out that newer hardware models generally bring better performance improvements. 5. **Effectiveness of Hyperparameter Optimization**: Finally, the paper demonstrated that applying hyperparameter optimization tools (such as Hyperopt-based InfPop) during hardware upgrades or downgrades can significantly improve the throughput of HuggingFace Pipelines, with average improvements of 9.16% (GPU upgrade scenario) and 13.7% (GPU downgrade scenario). In summary, the focus of the paper is on exploring how to maximize the inference performance of large language models through reasonable hyperparameter settings, and it provides empirical analysis to support its conclusions.

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Inference Performance Optimization for Large Language Models on CPUs

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Efficient LLM inference solution on Intel GPU

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Deploying Open-Source Large Language Models: A performance Analysis

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Inference Acceleration for Large Language Models on CPUs

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models

A Hardware Evaluation Framework for Large Language Model Inference

Fast Distributed Inference Serving for Large Language Models

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Understanding LLMs: A Comprehensive Overview from Training to Inference

Self-Selected Attention Span for Accelerating Large Language Model Inference

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Using Large Language Models for Hyperparameter Optimization