Abstract:The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands. These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can achieve up to 100x speedup over existing solutions. Combined with an LLM inference engine, it can generate near-zero overhead structure generation in end-to-end low-LLM serving.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the inefficiency faced during structured generation in large - language models (LLMs). Specifically, as the complexity and diversity of LLM applications continue to increase, the demand for structured outputs that can be parsed into code, structured function calls, and entity proxy commands, etc., is also growing day by day. This has led to a high demand for structured generation during LLM inference. However, when using context - free grammar (CFG) for constrained decoding, since it is necessary to traverse all tokens in the vocabulary at runtime and pass through multiple stack states, a non - negligible overhead is brought. Therefore, the paper proposes XGrammar, a flexible and efficient structured - generation engine, aiming to accelerate CFG execution and reduce the overhead of structured generation. ### Main Contributions 1. **Adaptive Token - Mask Cache**: By pre - computing the validity of context - independent tokens and storing them in a cache specific to each automaton position, the overhead of mask generation is significantly reduced. 2. **Persistent Execution Stack**: By means of fast roll - back operations, fast state branching and roll - back, the processing speed of context - dependent tokens is accelerated. 3. **Efficient Grammar Engine**: Co - designed with the LLM service framework, it achieves the minimum overhead of structured generation. ### Technical Details - **Adaptive Token - Mask Cache**: Tokens are divided into context - independent tokens and context - dependent tokens. Context - independent tokens can be retrieved directly from the cache at runtime, while context - dependent tokens need to be dynamically checked at runtime. Through this classification, the amount of computation at runtime is reduced. - **Context Expansion**: Utilizing the context information of the grammar, more context - dependent tokens are rejected in the pre - processing stage, further reducing the number of tokens that need to be checked at runtime. - **Persistent Execution Stack**: Managing multiple parallel stacks, supporting fast state branching and roll - back operations, reducing memory redundancy and the overhead of state branching. - **Push - Down Automaton Structure Optimization**: Introducing an in - rule inlining strategy, reducing the ambiguity of the grammar and improving the effect of context expansion. ### Experimental Results The evaluation results show that XGrammar can achieve a speed - up of up to 100 times compared to existing solutions. Combined with the LLM inference engine, it can generate structured outputs with almost zero overhead in end - to - end low - latency LLM services. ### Conclusion XGrammar significantly improves the efficiency and performance of large - language models in structured - generation tasks through a series of innovative technologies, providing strong support for complex LLM applications.

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation

SGLang: Efficient Execution of Structured Language Model Programs

Grammar Prompting for Domain-Specific Language Generation with Large Language Models

LLMCad: Fast and Scalable On-device Large Language Model Inference

Grammar-based Game Description Generation using Large Language Models

Guiding Large Language Models to Generate Computer-Parsable Content

Struct-X: Enhancing Large Language Models Reasoning with Structured Data

A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation

Supporting Meta-model-based Language Evolution and Rapid Prototyping with Automated Grammar Optimization

Improving Natural Language Capability of Code Large Language Model

Planning-Driven Programming: A Large Language Model Programming Workflow

Leveraging Grammar Induction for Language Understanding and Generation

GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models

IterGen: Iterative Structured LLM Generation

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

Understanding LLMs: A Comprehensive Overview from Training to Inference

Enhancing Program Synthesis with Large Language Models Using Many-Objective Grammar-Guided Genetic Programming