UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Siyuan Wu,Yue Huang,Chujie Gao,Dongping Chen,Qihui Zhang,Yao Wan,Tianyi Zhou,Xiangliang Zhang,Jianfeng Gao,Chaowei Xiao,Lichao Sun

2024-08-23

Abstract:Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

Computation and Language

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in generating text datasets using large - language models (LLMs), specifically including: 1. **Generalization and Controllability**: Existing generation frameworks usually directly modify data items in the original dataset based on fixed principles, which limits the generalization ability of the generated data. Moreover, these frameworks are often limited to specific dataset formats or types, such as multiple - choice questions or math - oriented datasets. In addition, there is a lack of an integration mechanism for external constraints, for example, a user may specify the length of the generated text, which limits the controllability of the generation process. 2. **Diversity and Truthfulness**: Previous attempts often overlooked certain quality requirements of datasets, such as diversity and truthfulness. Directly applying LLMs for dataset generation may lead to problems of repetition and low diversity, because when faced with semantically similar inputs, LLMs may output the same answer. In addition, the tendency of LLMs to produce hallucinations may introduce factual errors, thereby reducing the model performance when using such datasets for training or fine - tuning. To address these challenges, the paper proposes **UNIGEN**, a unified, LLM - driven framework aimed at generating high - quality, diverse, accurate, and highly controllable datasets. UNIGEN enhances data diversity by introducing an attribute - guided generation module and a group - checking function; it performs label verification and fact verification through code - based mathematical evaluation and retrieval - enhanced generation techniques to ensure data truthfulness; and it allows users to specify constraints so that the data generation process can meet specific requirements.

UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

UniGen: A Unified Generative Framework for Retrieval and Question Answering with Large Language Models

UniAudio: Towards Universal Audio Generation with Large Language Models

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models

Unified Language Model Pre-training for Natural Language Understanding and Generation

ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation

UniGen: Universal Domain Generalization for Sentiment Classification via Zero-shot Dataset Generation

OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs

A Comprehensive Evaluation of Constrained Text Generation for Large Language Models.

UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing

Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models

Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

UniG3D: A Unified 3D Object Generation Dataset

PatternGPT :A Pattern-Driven Framework for Large Language Model Text Generation

Supervised Knowledge Makes Large Language Models Better In-context Learners

TarGEN: Targeted Data Generation with Large Language Models