The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Matias Martinez
2024-08-02
Abstract:The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model deployment process. To utilize these LLMs, inference engines are needed. These engines load the model's weights onto available resources, such as GPUs, and process queries to generate responses. The speed of inference, or performance, of the LLM, is critical for real-time applications, as it computes millions or billions of floating point operations per inference. Recently, advanced inference engines such as vLLM have emerged, incorporating novel mechanisms such as efficient memory management to achieve state-of-the-art performance. In this paper, we analyze the performance, particularly the throughput (tokens generated per unit of time), of 20 LLMs using two inference libraries: vLLM and HuggingFace's pipelines. We investigate how various hyperparameters, which developers must configure, influence inference performance. Our results reveal that throughput landscapes are irregular, with distinct peaks, highlighting the importance of hyperparameter optimization to achieve maximum performance. We also show that applying hyperparameter optimization when upgrading or downgrading the GPU model used for inference can improve throughput from HuggingFace pipelines by an average of 9.16% and 13.7%, respectively.
Software Engineering,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores the impact of hyperparameters on the inference performance of large language models (LLMs), with a particular focus on throughput (the number of tokens generated per unit time). The authors analyze the performance of two popular inference libraries—vLLM and HuggingFace Pipelines—under different hyperparameter settings. The main research points in the paper include: 1. **Impact of Hyperparameters on Inference Performance**: The authors evaluated multiple large language models and observed how adjusting different hyperparameters (such as the number of GPUs used, batch size, etc.) affects inference performance. The results show that the throughput landscape is irregular and has noticeable peaks, indicating that hyperparameter optimization is necessary to achieve optimal performance. 2. **Impact of GPU Quantity on Online Inference**: Online inference refers to scenarios where each query contains only one input. The paper studied how different numbers of GPUs affect throughput and found that increasing the number of GPUs can improve throughput, but the growth is not linear, and there is an optimal configuration. 3. **Impact of Batch Size on Batch Inference**: Unlike online inference, batch inference allows processing multiple inputs at once. The paper examined how changes in batch size affect throughput and found that the choice of batch size is crucial; too large a batch size may lead to memory overflow, while too small a batch size may result in resource wastage. 4. **Comparison of Inference Performance Across Different GPU Models**: The paper also compared the inference performance when using different GPU models (e.g., Nvidia A100 vs. Nvidia V100) and pointed out that newer hardware models generally bring better performance improvements. 5. **Effectiveness of Hyperparameter Optimization**: Finally, the paper demonstrated that applying hyperparameter optimization tools (such as Hyperopt-based InfPop) during hardware upgrades or downgrades can significantly improve the throughput of HuggingFace Pipelines, with average improvements of 9.16% (GPU upgrade scenario) and 13.7% (GPU downgrade scenario). In summary, the focus of the paper is on exploring how to maximize the inference performance of large language models through reasonable hyperparameter settings, and it provides empirical analysis to support its conclusions.