Abstract:In a multi-tenant large language model (LLM) serving platform hosting diverse applications, some users may submit an excessive number of requests, causing the service to become unavailable to other users and creating unfairness. Existing fairness approaches do not account for variations in token lengths across applications and multiple LLM calls, making them unsuitable for such platforms. To address the fairness challenge, this paper analyzes millions of requests from thousands of users on MS CoPilot, a real-world multi-tenant LLM platform hosted by Microsoft. Our analysis confirms the inadequacy of existing methods and guides the development of FairServe, a system that ensures fair LLM access across diverse applications. FairServe proposes application-characteristic aware request throttling coupled with a weighted service counter based scheduling technique to curb abusive behavior and ensure fairness. Our experimental results on real-world traces demonstrate FairServe's superior performance compared to the state-of-the-art method in ensuring fairness. We are actively working on deploying our system in production, expecting to benefit millions of customers world-wide.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of unfair resource allocation in the multi - tenant large - language - model (LLM) service platform. Specifically, when multiple applications share the same LLM platform, some users may submit too many requests, causing the service to be unavailable to other users and resulting in unfairness. Existing fairness methods fail to take into account the differences in token lengths between different applications and the impact of multiple LLM invocations, so they are not suitable for such platforms. #### Summary of main problems: 1. **Abusive behavior**: Some users may abuse system resources by submitting a large number of requests, leading to service interruptions and unfair treatment of other users. 2. **Deficiencies of existing methods**: - **Rate - Per - Minute (RPM)**: Although it can prevent abuse, in a multi - agent scenario, restricting requests midway will lead to resource waste. - **Virtual Token Counter (VTC)**: It cannot completely prevent abusive behavior, and due to equal resource allocation, it may lead to resource waste, long request queues, increased latency, and a decline in user experience. #### Solution: To solve the above problems, this paper proposes a system named **FAIRSERVE**, which contains two core components: - **Overload and Interaction - driven Throttling (OIT)**: Restrict requests only when the system is overloaded and prevent token waste according to application characteristics. - **Weighted Service Counter (WSC)**: Select requests according to weighted resource slices to ensure fairness, especially giving priority to users who have received the least service. #### Experimental results: The experimental results show that FAIRSERVE is superior to existing methods in ensuring fairness and preventing abuse. It can reduce the waiting - queue latency (by 10.67 - 93 times), decrease the latency (by 1.03 - 1.06 times), increase the throughput (by 1.03 - 1.75 times), and achieve 0% token waste in all applications. Through these improvements, FAIRSERVE can better serve users in diverse application environments, ensure fair resource allocation, and effectively prevent abusive behavior.

Ensuring Fair LLM Serving Amid Diverse Applications

Fairness in Serving Large Language Models

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

The Impossibility of Fair LLMs

Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware Classification

Fairness in Large Language Models in Three Hours

A Survey on Fairness in Large Language Models

Efficient LLM Scheduling by Learning to Rank

Fairness in Large Language Models: A Taxonomic Survey

Learned Best-Effort LLM Serving

BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models

LLMs as On-demand Customizable Service

PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services

FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving

Fairness Definitions in Language Models Explained

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

Long-term Fairness in Ride-Hailing Platform

Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era

EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving

Fairness of ChatGPT