Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Chaofan Lin,Zhenhua Han,Chengruidong Zhang,Yuqing Yang,Fan Yang,Chen Chen,Lili Qiu

2024-05-30

Abstract:The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.

Machine Learning

What problem does this paper attempt to address?

The paper mainly focuses on the efficiency issues of large language models (LLMs) in serving LLM-based applications. Existing public LLM services only provide simple request-level APIs, which leads to the loss of application-level information and affects end-to-end performance. The paper proposes a LLM service system called Parrot, which introduces the concept of "semantic variables" to expose application-level knowledge to public LLM services. Semantic variables allow annotating input/output variables in LLM request prompts and creating data pipelines when connecting multiple LLM requests, making programming LLM applications more natural. By exposing semantic variables to public LLM services, the service can perform regular data flow analysis, discover the correlation between multiple LLM requests, and create new optimization opportunities to improve end-to-end performance of LLM-based applications. The paper points out that consecutive LLM requests may have dependencies, different scheduling preferences, and a large amount of redundant computation. The Parrot system addresses these issues with semantic variables, which can reduce network latency, achieve more efficient scheduling, and improve efficiency by eliminating redundant computations. Experiments show that Parrot can achieve performance improvements of up to an order of magnitude in popular and practical LLM application use cases.

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

Parrot: Multilingual Visual Instruction Tuning

Parsel: Algorithmic Reasoning with Language Models by Composing Decompositions

LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

Deploying and Evaluating LLMs to Program Service Mobile Robots

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System

LLMs as On-demand Customizable Service

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

LLM-based Optimization of Compound AI Systems: A Survey

FedML Parrot: A Scalable Federated Learning System via Heterogeneity-aware Scheduling on Sequential and Hierarchical Training

LLM-based Frameworks for Power Engineering from Routine to Novel Tasks

Training Language Model Agents without Modifying Language Models

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services

Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Octopus: On-device language model for function calling of software APIs