LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Małgorzata Łazuka,Andreea Anghel,Thomas Parnell

2024-10-03

Abstract:As Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.

Distributed, Parallel, and Cluster Computing,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of how to effectively select hardware to meet performance requirements and reduce costs in large-scale language model (LLM) inference services. Specifically, with the rapid development of large language models, inference services need to handle requests from thousands of users and meet certain performance standards. However, the impact of different hardware on the performance of inference services is not clear, making it very challenging to choose the appropriate hardware. To tackle this challenge, the paper proposes a system called LLM-Pilot, which can: 1. **Benchmarking**: Benchmark LLM inference services on different GPUs to ensure that the test load is similar to actual usage. 2. **Optimized Configuration**: Optimize the configuration of inference services for each type of GPU to maximize performance. 3. **Prediction Model**: Recommend the most cost-effective hardware configuration to meet the performance requirements of new LLMs by learning from historical performance data. Through these methods, LLM-Pilot can meet performance requirements more frequently than existing methods and reduce average costs by 60%.

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

UELLM: A Unified and Efficient Approach for LLM Inference Serving

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference

Plug-and-Play Performance Estimation for LLM Services without Relying on Labeled Data

Fast distributed inference serving for large language models

Inference Performance Optimization for Large Language Models on CPUs

A Hardware Evaluation Framework for Large Language Model Inference

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Efficient LLM inference solution on Intel GPU

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline