Abstract:Presenting users with diverse responses from foundation models is crucial for enhancing user experience and accommodating varying preferences. However, generating multiple high-quality and diverse responses without sacrificing accuracy remains a challenge, especially when using greedy sampling. In this work, we propose a novel framework, Synthesize-Partition-Adapt (SPA), that leverages the abundant synthetic data available in many domains to elicit diverse responses from foundation models. By leveraging signal provided by data attribution methods such as influence functions, SPA partitions data into subsets, each targeting unique aspects of the data, and trains multiple model adaptations optimized for these subsets. Experimental results demonstrate the effectiveness of our approach in diversifying foundation model responses while maintaining high quality, showcased through the HumanEval and MBPP tasks in the code generation domain and several tasks in the natural language understanding domain, highlighting its potential to enrich user experience across various applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate diverse and high - quality responses from foundation models to enhance user experience and meet different user preferences. Specifically, the authors point out that when using greedy sampling, generating multiple high - quality and diverse responses remains a challenge. To address this issue, they propose a new framework - Synthesize - Partition - Adapt (SPA), which stimulates the foundation model to produce more diverse outputs by leveraging rich synthetic data. ### Main Problems 1. **Balance between Diversity and Quality**: Traditional diversity generation methods (such as temperature sampling) often sacrifice the quality of responses while increasing diversity. Especially when using greedy sampling, it is difficult to ensure both diversity and accuracy simultaneously. 2. **Effective Utilization of Large - scale Synthetic Data**: As the scale of synthetic datasets continues to increase, how to effectively use these data to train multiple model adaptations to generate diverse responses has become a key issue. ### Solutions The core ideas of the SPA framework are: - **Synthesize**: Utilize existing synthetic datasets, which can be generated in various ways, such as through data augmentation, back - translation and other techniques. - **Partition**: Use data attribution methods (such as influence functions) to divide the synthetic dataset into multiple subsets, each subset targeting different aspects of the data. - **Adapt**: Perform parameter - efficient fine - tuning (such as LoRA) on each subset and train multiple model adaptations so that each adapted model focuses on a specific data subset, thereby generating diverse responses. ### Experimental Verification The authors verified the effectiveness of the SPA framework through experiments, mainly testing on code generation tasks (such as HumanEval and MBPP) and natural language understanding tasks. The experimental results show that the SPA framework can not only improve the diversity of generated responses but also maintain high quality, significantly enhancing the user experience. ### Formula Presentation Some of the formulas involved in the paper are as follows: - **Calculation of Influence Function**: \[ I((x_i, y_i), (x_t^{(m)}, y_t^{(m)})) = -\nabla_\theta \ell(y_t^{(m)}, M(x_t^{(m)}; \hat{\theta}))^\top H^{-1}_{\hat{\theta}} \nabla_\theta \ell(y_i, M(x_i; \hat{\theta})) \] where \(H_{\hat{\theta}}\) is the Hessian matrix of the loss function with respect to the model parameters, and \(\ell\) is an appropriate loss function. - **Average KL Divergence**: \[ \text{Average KL Divergence} = \frac{1}{\binom{K}{2}} \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^N D_{KL}(P_i \| P_j) \] where \(D_{KL}(P_i \| P_j)\) is the KL divergence between the probability distributions of the responses generated by two model adaptations \(i\) and \(j\). Through these methods, the SPA framework has successfully addressed the challenge of generating diverse and high - quality responses from foundation models, providing users with a richer and more personalized experience.

Synthesize, Partition, then Adapt: Eliciting Diverse Samples from Foundation Models

Adversarial Sample Synthesis for Visual Question Answering

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws

Quality-Diversity Generative Sampling for Learning with Synthetic Data

Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Enhancing Vision-Language Models Generalization via Diversity-Driven Novel Feature Synthesis

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Adaptive Sampling Strategies to Construct Equitable Training Datasets

AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Dual-Personalizing Adapter for Federated Foundation Models

Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering

On the Efficacy of Sampling Adapters

Diverse Intra- and Inter-Domain Activity Style Fusion for Cross-Person Generalization in Activity Recognition

Structuring Latent Spaces for Stylized Response Generation

Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications

CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs