Abstract:While large language models (LLMs) bring not only performance but also complexity, recent work has started to turn LLMs into data generators rather than task inferencers, where another affordable task model is trained for efficient deployment and inference. However, such an approach has primarily been applied to natural language tasks and has not yet been explored for symbolic language tasks with complex structured outputs (e.g., semantic parsing and code generation). In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data. SymGen consists of an informative prompt to steer generation and an agreement-based verifier to improve data correctness. We conduct extensive experiments on six symbolic language tasks across various settings. Compared with the LLMs, we demonstrate the 1\%-sized task model can achieve comparable or better performance, largely cutting inference and deployment costs. We also show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model, saving a considerable amount of annotation effort. SymGen sheds new light on data generation for complex tasks, and we release the code at \href{<a class="link-external link-https" href="https://github.com/HKUNLP/SymGen" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/HKUNLP/SymGen" rel="external noopener nofollow">this https URL</a>}.

Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Schema Matching with Large Language Models: an Experimental Study

SQL-to-Schema Enhances Schema Linking in Text-to-SQL

CHESS: Contextual Harnessing for Efficient SQL Synthesis

SA-SQL: A Schema-Aligned Framework for Text-to-SQL Through Large Language Models

RSL-SQL: Robust Schema Linking in Text-to-SQL Generation

The Death of Schema Linking? Text-to-SQL in the Age of Well-Reasoned Language Models

RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL

Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation

Generating Data for Symbolic Language with Large Language Models

Conceptual Schema Optimisation -- Database Optimisation before sliding down the Waterfall

Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload

Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance

Structsum Generation for Faster Text Comprehension

Algebraic Meta-structure Handling of Huge Database Schemata

Prompt Sketching for Large Language Models

Towards Agentic Schema Refinement

Magneto: Combining Small and Large Language Models for Schema Matching

Matchmaker: Self-Improving Large Language Model Programs for Schema Matching