Abstract:Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language understanding, information retrieval and search, translation, chatbots, virtual assistance, and many more. However, it is well known that LLMs are massive in terms of the number of parameters. Additionally, the self-attention mechanism in the underlying architecture of LLMs, Transformers, has quadratic complexity in terms of both computation and memory with respect to the input sequence length. For these reasons, LLM inference is resource-intensive, and thus, the throughput of LLM inference is limited, especially for the longer sequences. In this report, we design a collaborative inference architecture between a server and its clients to alleviate the throughput limit. In this design, we consider the available resources on both sides, i.e., the computation and communication costs. We develop a dynamic programming-based algorithm to optimally allocate computation between the server and the client device to increase the server throughput, while not violating the service level agreement (SLA). We show in the experiments that we are able to efficiently distribute the workload allowing for roughly 1/3 reduction in the server workload, while achieving 19 percent improvement over a greedy method. As a result, we are able to demonstrate that, in an environment with different types of LLM inference requests, the throughput of the server is improved.

What problem does this paper attempt to address?

This paper attempts to solve the problems of large - scale language models (LLMs) such as huge resource consumption and limited throughput during reasoning. Specifically, due to the large number of parameters in LLMs and the quadratic complexity of the self - attention mechanism, the computational and memory costs increase rapidly when processing long - sequence inputs, which limits the throughput of servers, especially in edge - computing environments. To solve this problem, the paper proposes a collaborative reasoning architecture, which reasonably allocates computational tasks between the server and the client to reduce the server's workload and improve the overall throughput. ### Main contributions of the paper 1. **Collaborative reasoning architecture**: A reasoning architecture for collaboration between the server and the client is designed. Considering the computational and communication costs of both sides, the computational tasks are optimally allocated through a dynamic programming algorithm to improve the server's throughput without violating the service - level agreement (SLA). 2. **Dynamic programming algorithm**: An algorithm based on dynamic programming is developed to optimally allocate computational tasks between the server and the client, thereby reducing the server's workload and increasing the throughput. 3. **Experimental verification**: The effectiveness of this method is verified through experiments. The results show that compared with the greedy method, this method can significantly reduce the server's workload and increase the throughput by about 19%. ### Formula explanation - **Delay constraint**: \[ \Lambda(m) \geq \sum_{l \in L(m)} x_l \left( c(e)_l+(1 - x_{l - 1})d_l\right)+\sum_{l \in L(m)} (1 - x_l)\left( c(s)_l+x_{l - 1}u_l\right) \] where: - \( \Lambda(m) \) is the maximum allowable delay of model \( m \). - \( x_l \) is a binary variable, indicating whether the \( l \)-th layer is executed on the client (1 means on the client, 0 means on the server). - \( c(e)_l \) and \( c(s)_l \) are the times required to execute the \( l \)-th layer on the client and the server respectively. - \( d_l \) and \( u_l \) are the times required to download and upload the data of the \( l \)-th layer from the server respectively. - \( x_{l - 1} \) indicates the execution location of the \( l - 1 \)-th layer. - **Optimization objective**: \[ \min_{x_l\forall l \in L(m)} \sum_{l \in L(m)} (1 - x_l)r_l \] where: - \( r_l \) is the computational load of the \( l \)-th layer, which can be FLOP (floating - point operations) or other computational indicators. ### Conclusion By reasonably allocating computational tasks between the server and the client, the collaborative reasoning architecture proposed in this paper can significantly reduce the server's workload and improve the throughput without violating the delay requirements. This method is of great significance for deploying large - scale language models in edge - computing environments.

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

Efficient and Economic Large Language Model Inference with Attention Offloading

Aladdin: Joint Placement and Scaling for SLO-Aware LLM Serving

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

Efficient LLM Scheduling by Learning to Rank

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

UELLM: A Unified and Efficient Approach for LLM Inference Serving

AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction

New Solutions on LLM Acceleration, Optimization, and Application

Splitwise: Efficient generative LLM inference using phase splitting

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Distributed Inference Performance Optimization for LLMs on CPUs

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

OptLLM: Optimal Assignment of Queries to Large Language Models

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

A System for Microserving of LLMs

Towards Pareto Optimal Throughput in Small Language Model Serving

Llumnix: Dynamic Scheduling for Large Language Model Serving