Abstract:Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. Notably, on AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model, which was trained on over 10M pairs; see our project at <a class="link-external link-https" href="https://github.com/ShuoTang123/MATRIX-Gen" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by large - language models (LLMs) in the post - training process, especially how to generate high - quality instruction data that meets the real - world requirements. Specifically: 1. **Challenges in data acquisition**: There are significant challenges in obtaining high - quality instruction data from the real world, including privacy issues, data scarcity, and high labor costs. 2. **Limitations of existing methods**: Existing data synthesis methods usually rely on aligned LLMs to generate new instructions. Although these methods are efficient, they cannot explicitly incorporate real - world user requirements into the data synthesis process. In addition, these methods are highly dependent on manually - designed predefined prompts, which increases the risk of generating unrealistic instructions that do not meet the actual user requirements and reduces the controllability of generating specific data. To solve these problems, the paper proposes a new framework based on multi - agent simulation for automatically generating diverse text scenarios and capturing a wide range of real - world human needs. Specifically, the main contributions of the paper include: - **Introducing multi - agent simulation**: This is the first time that multi - agent simulation has been applied to post - training data synthesis of LLMs. By simulating diverse and highly realistic social scenarios, it not only improves the authenticity of the synthesized data but also provides the controllability required to generate specific, high - quality synthesized data. - **Proposing a new post - training data synthesis framework**: This framework integrates a multi - agent social simulator (MATRIX) and a demand - oriented instruction generator (MATRIX - Gen). Using the diverse and realistic scenarios generated by the simulator, it can synthesize high - quality real post - training data suitable for various scenarios. - **Extensive experimental evaluation**: Through a large number of experiments, the effectiveness of the proposed framework has been verified. In particular, in the AlpacaEval 2 and Arena - Hard benchmark tests, the Llama - 3 - 8B - Base model post - trained with 20,000 synthesized instruction - response pairs outperforms Meta's Llama - 3 - 8B - Instruct model post - trained with more than 10 million pairs in multiple areas (such as general problem - solving ability, multi - round dialogue ability, coding accuracy, and security level). In conclusion, this paper aims to improve the post - training effect of LLMs through innovative data synthesis methods, making them more effectively understand and follow human instructions.

Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

Instruction Pre-Training: Language Models are Supervised Multitask Learners

AgentInstruct: Toward Generative Teaching with Agentic Flows

Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation

Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

GenSim: Generating Robotic Simulation Tasks via Large Language Models

CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models

TrainerAgent: Customizable and Efficient Model Training Through LLM-Powered Multi-Agent System.

EduAgent: Generative Student Agents in Learning

Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

AgentBench: Evaluating LLMs as Agents

MetaAgents: Simulating Interactions of Human Behaviors for LLM-based Task-oriented Coordination via Collaborative Generative Agents

AgentTuning: Enabling Generalized Agent Abilities for LLMs

REInstruct: Building Instruction Data from Unlabeled Corpus

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents