BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

Bodun Hu,Jiamin Li,Le Xu,Myungjin Lee,Akshay Jajoo,Geon-Woo Kim,Hong Xu,Aditya Akella
2024-09-24
Abstract:The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial challenges due to their high computational and memory demands. We introduce BlockLLM, a serving system that leverages component sharing among fine-tuned LLM models to provide an efficient and flexible solution for LLM workloads. BlockLLM partitions models into finer-grained blocks, enabling the reuse of model components and independent provisioning to improve computation efficiency. BlockLLM comprises an offline block zoo for storing blocks and an online system to serve requests through chains of blocks. It offers multi-fold flexibilities: (1) Adaptive assembly of blocks on-the-fly through equivalence evaluation among blocks in the zoo; (2) Per-block batch size configuration and best-effort KV cache coordination at the individual block level; (3) Speculative execution and locality-aware block placement to reduce communication costs from dynamic block resource allocation. Our evaluation shows that BlockLLM reduces memory and storage footprints and improves computational efficiency, outperforming existing serving approach in 95%ile latency and GPU utilization by 33.5% and 20.1%, respectively, with minimal impact on accuracy
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper attempts to address the high computational and memory demands faced when deploying large language models (LLMs) in multi-tenant environments. Specifically, the authors propose a service system called BlockLLM, which aims to improve the efficiency and flexibility of LLM services through component sharing and fine-grained partitioning. Below are the main issues and solutions presented in the paper: ### Issues 1. **High computational and memory demands**: Deploying LLMs in multi-tenant environments requires a large amount of computational resources and memory, leading to inefficient resource utilization and high costs. 2. **Resource allocation in multi-tenant environments**: How to efficiently serve fine-tuned models for multiple tenants in a shared cluster while meeting the requirements of low latency and high throughput. 3. **Limitations of existing service systems**: Existing LLM service systems typically deploy each model as a whole, failing to fully utilize the shared components within the models, resulting in resource wastage. ### Solutions 1. **Fine-grained partitioning and component sharing**: BlockLLM divides LLMs into smaller, shareable components (referred to as "blocks") and achieves flexible service through the dynamic combination of these blocks. 2. **Block Zoo**: BlockLLM maintains an offline Block Zoo that stores various LLM blocks and evaluates the equivalence relationships between blocks to support adaptive online services. 3. **Dynamic resource allocation**: Each block can be independently configured and scaled, with resources dynamically allocated based on actual demand, improving resource utilization. 4. **KV cache management**: BlockLLM ensures KV cache consistency between different requests through a best-effort KV cache coordination strategy, reducing communication overhead. 5. **Speculative execution and locality-aware placement**: By predicting the output of bottleneck blocks and employing a locality-aware block placement strategy, BlockLLM reduces cross-server communication latency. ### Main Contributions 1. **Reduced memory and storage demands**: Through the sharing and reuse of blocks, BlockLLM reduces memory and storage demands, improving the utilization efficiency of computational resources. 2. **Building the Block Zoo**: Storing and managing fine-grained LLM blocks and establishing equivalence relationships between blocks to support adaptive online services. 3. **Increased cluster throughput**: Through best-effort request scheduling and KV cache management, as well as speculative execution and locality-aware placement, BlockLLM increases overall cluster throughput. 4. **Performance evaluation**: BlockLLM was implemented on a cluster of 12 A100 GPUs, demonstrating significant advantages in reducing 95%ile latency and improving GPU utilization. Through these innovations, BlockLLM effectively addresses the challenges of LLM deployment and service in multi-tenant environments, enhancing resource utilization efficiency and system performance.