XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Yixin Dong,Charlie F. Ruan,Yaxing Cai,Ruihang Lai,Ziyi Xu,Yilong Zhao,Tianqi Chen
2024-11-23
Abstract:The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.
Computation and Language,Artificial Intelligence,Programming Languages
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency faced during structured generation in large - language models (LLMs). Specifically, as the complexity and diversity of LLM applications continue to increase, the demand for structured outputs that can be parsed into code, structured function calls, and entity proxy commands, etc., is also growing day by day. This has led to a high demand for structured generation during LLM inference. However, when using context - free grammar (CFG) for constrained decoding, since it is necessary to traverse all tokens in the vocabulary at runtime and pass through multiple stack states, a non - negligible overhead is brought. Therefore, the paper proposes XGrammar, a flexible and efficient structured - generation engine, aiming to accelerate CFG execution and reduce the overhead of structured generation. ### Main Contributions 1. **Adaptive Token - Mask Cache**: By pre - computing the validity of context - independent tokens and storing them in a cache specific to each automaton position, the overhead of mask generation is significantly reduced. 2. **Persistent Execution Stack**: By means of fast roll - back operations, fast state branching and roll - back, the processing speed of context - dependent tokens is accelerated. 3. **Efficient Grammar Engine**: Co - designed with the LLM service framework, it achieves the minimum overhead of structured generation. ### Technical Details - **Adaptive Token - Mask Cache**: Tokens are divided into context - independent tokens and context - dependent tokens. Context - independent tokens can be retrieved directly from the cache at runtime, while context - dependent tokens need to be dynamically checked at runtime. Through this classification, the amount of computation at runtime is reduced. - **Context Expansion**: Utilizing the context information of the grammar, more context - dependent tokens are rejected in the pre - processing stage, further reducing the number of tokens that need to be checked at runtime. - **Persistent Execution Stack**: Managing multiple parallel stacks, supporting fast state branching and roll - back operations, reducing memory redundancy and the overhead of state branching. - **Push - Down Automaton Structure Optimization**: Introducing an in - rule inlining strategy, reducing the ambiguity of the grammar and improving the effect of context expansion. ### Experimental Results The evaluation results show that XGrammar can achieve a speed - up of up to 100 times compared to existing solutions. Combined with the LLM inference engine, it can generate structured outputs with almost zero overhead in end - to - end low - latency LLM services. ### Conclusion XGrammar significantly improves the efficiency and performance of large - language models in structured - generation tasks through a series of innovative technologies, providing strong support for complex LLM applications.