LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Małgorzata Łazuka,Andreea Anghel,Thomas Parnell
2024-10-03
Abstract:As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
Distributed, Parallel, and Cluster Computing,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of how to effectively select hardware to meet performance requirements and reduce costs in large-scale language model (LLM) inference services. Specifically, with the rapid development of large language models, inference services need to handle requests from thousands of users and meet certain performance standards. However, the impact of different hardware on the performance of inference services is not clear, making it very challenging to choose the appropriate hardware. To tackle this challenge, the paper proposes a system called LLM-Pilot, which can: 1. **Benchmarking**: Benchmark LLM inference services on different GPUs to ensure that the test load is similar to actual usage. 2. **Optimized Configuration**: Optimize the configuration of inference services for each type of GPU to maximize performance. 3. **Prediction Model**: Recommend the most cost-effective hardware configuration to meet the performance requirements of new LLMs by learning from historical performance data. Through these methods, LLM-Pilot can meet performance requirements more frequently than existing methods and reduce average costs by 60%.