On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

Banooqa Banday,Kowshik Thopalli,Tanzima Z. Islam,Jayaraman J. Thiagarajan
2024-09-06
Abstract:LLM-based data generation for real-world tabular data can be challenged by the lack of sufficient semantic context in feature names used to describe columns. We hypothesize that enriching prompts with domain-specific insights can improve both the quality and efficiency of data generation. To test this hypothesis, we explore three prompt construction protocols: Expert-guided, LLM-guided, and Novel-Mapping. Through empirical studies with the recently proposed GReaT framework, we find that context-enriched prompts lead to significantly improved data generation quality and training efficiency.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low quality and efficiency of data generation when generating real - world tabular data due to the lack of sufficient semantic context in feature names. Specifically, the paper mentions that feature names in many actual tabular datasets may be ambiguous, contain abbreviations or symbols that are not easily understood, or even be some general - purpose labels, all of which may lead to poor data generation results based on large language models (LLMs). For this reason, the author hypothesizes that adding domain - specific knowledge to the prompt can significantly improve the ability of LLMs to generate high - quality tabular data and training efficiency. To verify this hypothesis, the paper proposes three different prompt construction protocols: 1. **Expert - guided**: Domain experts provide detailed feature descriptions to enrich the prompt. 2. **LLM - guided**: Use an external LLM to automatically generate feature descriptions based on the given feature names and dataset names. 3. **Novel - Mapping**: Use an external LLM to map general - purpose feature names to meaningful features in a new domain (such as physics or life sciences) according to their value ranges. Through experiments on multiple datasets, the paper shows that these context - rich prompt strategies not only improve the quality of the generated data but also significantly improve the training efficiency, especially when using parameter - efficient fine - tuning methods such as LoRA. In addition, when the feature names are completely general - purpose and lack relevant context, the Novel - Mapping strategy also shows significant effects.