Abstract:Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs' limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM's generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE's latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2-7 percent in certain cases. The data and code will be publicly available upon completion of internal review.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges encountered in generating high - quality synthetic data using large language models (LLMs), especially for data in a structured format. Specifically, the paper points out that the current methods of using LLMs to generate synthetic data have the following problems: 1. **Limited understanding of the target data distribution**: Even after fine - tuning, it is difficult for LLMs to inject information about complex and variable data distributions, resulting in low - diversity generated data and a tendency to replicate data easily. 2. **Complex prompt engineering**: Existing LLMs - based synthetic data generation methods usually involve complex pipelines and post - processing mechanisms, such as prompt engineering, multi - agent frameworks and iterative sampling. These complexities impede the rapid adaptation of LLMs to new tasks and limit their practicality in dynamic research and industrial scenarios. To overcome these problems, the paper proposes a new framework named DiffLM, which improves synthetic data generation in the following ways: - **Decoupled data distribution learning**: A small projection network is introduced to enable LLMs to learn the real data distribution from external information without affecting their training objectives. - **High - quality synthetic data generation**: Through a carefully designed variational auto - encoder (VAE) and diffusion model structure, the distribution of real data is effectively modeled and high - quality synthetic data is generated. - **Comprehensive evaluation**: The quality of data generated by DiffLM is verified on three different scenarios and seven data sets, demonstrating its robustness and adaptability in natural language processing synthetic data generation. In summary, the main objective of the paper is to improve the quality and controllability of LLMs in generating synthetic data in a structured format by combining VAE and diffusion models.

DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

Multimodal Latent Language Modeling with Next-Token Diffusion

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

Diffusion-LM Improves Controllable Text Generation

LDSeq: Latent Diffusion Models for Sequence to Sequence Text Generation

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

LLM-grounded Video Diffusion Models

Latent Diffusion for Language Generation

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

TimeLDM: Latent Diffusion Model for Unconditional Time Series Generation

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

A Cheaper and Better Diffusion Language Model with Soft-Masked Noise

Chunk-Distilled Language Modeling

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

Quantized Embedding Vectors for Controllable Diffusion Language Models

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

Decoder-Only LLMs Are Better Controllers for Diffusion Models

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models