Abstract:The increasing demand for Large Language Models (LLMs) across various applications has led to a significant shift in the design of deep learning serving systems. Deploying LLMs, particularly in multi-tenant environments, poses substantial challenges due to their high computational and memory demands. We introduce BlockLLM, a serving system that leverages component sharing among fine-tuned LLM models to provide an efficient and flexible solution for LLM workloads. BlockLLM partitions models into finer-grained blocks, enabling the reuse of model components and independent provisioning to improve computation efficiency. BlockLLM comprises an offline block zoo for storing blocks and an online system to serve requests through chains of blocks. It offers multi-fold flexibilities: (1) Adaptive assembly of blocks on-the-fly through equivalence evaluation among blocks in the zoo; (2) Per-block batch size configuration and best-effort KV cache coordination at the individual block level; (3) Speculative execution and locality-aware block placement to reduce communication costs from dynamic block resource allocation. Our evaluation shows that BlockLLM reduces memory and storage footprints and improves computational efficiency, outperforming existing serving approach in 95%ile latency and GPU utilization by 33.5% and 20.1%, respectively, with minimal impact on accuracy

What problem does this paper attempt to address?

The paper attempts to address the high computational and memory demands faced when deploying large language models (LLMs) in multi-tenant environments. Specifically, the authors propose a service system called BlockLLM, which aims to improve the efficiency and flexibility of LLM services through component sharing and fine-grained partitioning. Below are the main issues and solutions presented in the paper: ### Issues 1. **High computational and memory demands**: Deploying LLMs in multi-tenant environments requires a large amount of computational resources and memory, leading to inefficient resource utilization and high costs. 2. **Resource allocation in multi-tenant environments**: How to efficiently serve fine-tuned models for multiple tenants in a shared cluster while meeting the requirements of low latency and high throughput. 3. **Limitations of existing service systems**: Existing LLM service systems typically deploy each model as a whole, failing to fully utilize the shared components within the models, resulting in resource wastage. ### Solutions 1. **Fine-grained partitioning and component sharing**: BlockLLM divides LLMs into smaller, shareable components (referred to as "blocks") and achieves flexible service through the dynamic combination of these blocks. 2. **Block Zoo**: BlockLLM maintains an offline Block Zoo that stores various LLM blocks and evaluates the equivalence relationships between blocks to support adaptive online services. 3. **Dynamic resource allocation**: Each block can be independently configured and scaled, with resources dynamically allocated based on actual demand, improving resource utilization. 4. **KV cache management**: BlockLLM ensures KV cache consistency between different requests through a best-effort KV cache coordination strategy, reducing communication overhead. 5. **Speculative execution and locality-aware placement**: By predicting the output of bottleneck blocks and employing a locality-aware block placement strategy, BlockLLM reduces cross-server communication latency. ### Main Contributions 1. **Reduced memory and storage demands**: Through the sharing and reuse of blocks, BlockLLM reduces memory and storage demands, improving the utilization efficiency of computational resources. 2. **Building the Block Zoo**: Storing and managing fine-grained LLM blocks and establishing equivalence relationships between blocks to support adaptive online services. 3. **Increased cluster throughput**: Through best-effort request scheduling and KV cache management, as well as speculative execution and locality-aware placement, BlockLLM increases overall cluster throughput. 4. **Performance evaluation**: BlockLLM was implemented on a cluster of 12 A100 GPUs, demonstrating significant advantages in reducing 95%ile latency and improving GPU utilization. Through these innovations, BlockLLM effectively addresses the challenges of LLM deployment and service in multi-tenant environments, enhancing resource utilization efficiency and system performance.

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving

BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

LLMs as On-demand Customizable Service

Edge-LLM: A Collaborative Framework for Large Language Model Serving in Edge Computing

A System for Microserving of LLMs

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Efficient and Economic Large Language Model Inference with Attention Offloading

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Efficient Memory Management for Large Language Model Serving with PagedAttention

Efficient Deployment of Large Language Model Across Cloud-Device Systems

All Language Models Large and Small

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Fast Distributed Inference Serving for Large Language Models

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing