Abstract:The development of Large Language Models (LLMs) often confronts challenges stemming from the heavy reliance on human annotators in the reinforcement learning with human feedback (RLHF) framework, or the frequent and costly external queries tied to the self-instruct paradigm. In this work, we pivot to Reinforcement Learning (RL) -- but with a twist. Diverging from the typical RLHF, which refines LLMs following instruction data training, we use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning. Our method, TeaMs-RL, uses a suite of textual operations and rules, prioritizing the diversification of training datasets. It facilitates the generation of high-quality data without excessive reliance on external advanced models, paving the way for a single fine-tuning step and negating the need for subsequent RLHF stages. Our findings highlight key advantages of our approach: reduced need for human involvement and fewer model queries (only $5.73\%$ of the strong baseline's total), along with enhanced capabilities of LLMs in crafting and comprehending complex instructions compared to strong baselines, and substantially improved model privacy protection. Code is available at the link: <a class="link-external link-https" href="https://github.com/SafeRL-Lab/TeaMs-RL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce the dependence on human annotators during the training process of large - language models (LLMs) and to decrease the frequent and expensive external query requirements associated with the self - instruction paradigm. Specifically, the authors propose a new method named TeaMs - RL, which directly generates the basic instruction data set for fine - tuning through reinforcement learning (RL), rather than using RL to optimize LLMs after instruction data training. This method aims to reduce the need for human participation, reduce the number of model queries, improve the ability of LLMs to handle complex instructions, and enhance model privacy protection. ### Main Contributions 1. **Reduction of Human Participation**: By using the TeaMs - RL method, the need for human annotators can be significantly reduced, thereby reducing costs. 2. **Reduction of Model Query Times**: Compared with strong baselines, the TeaMs - RL method only requires 5.73% of the model query times. 3. **Improvement of Instruction Quality and Model Performance**: The generated instruction data set is more diverse and of high quality, which helps to improve the ability of LLMs to handle complex tasks. 4. **Enhancement of Model Privacy Protection**: By reducing the dependence on external data, the risk of data leakage is reduced. ### Method Overview 1. **Training the Instruction Generator**: - Use continuous action - space encoding and the diversity rule as the reward function to train an instruction generator (instructor LLM). - The initial instructions are input into the instruction generator, and the generator generates complex instructions. - Adjust the generator's policy by evaluating the diversity of the generated instructions. 2. **Generating the Instruction - Response Data Set**: - Use the trained instruction generator and expert LLMs (such as ChatGPT) to generate high - quality instructions and corresponding responses. - These instructions and responses form a high - quality data set for fine - tuning the pre - aligned LLM. 3. **Fine - Tuning the Pre - Aligned LLM**: - Use the generated instruction - response data set to perform supervised fine - tuning (SFT) on the pre - aligned LLM, thereby obtaining an LLM that can handle complex instructions. ### Experimental Results - **Enhanced Instruction Diversity**: The experimental results show that the TeaMs - RL method can significantly improve the diversity of the instruction data set. - **Performance Improvement**: In the ARC and HellaSwag benchmark tests, the TeaMs - RL method outperforms the WizardLM - 7b model. - **Cost - Effectiveness**: The size of the data set used by the TeaMs - RL method is only about one - fourteenth of that of the WizardLM method, and the number of queries is greatly reduced, significantly reducing the training cost. - **Privacy Protection**: In the model privacy attack experiment, the TeaMs - RL method shows stronger privacy protection performance, with its ROC curve close to random guessing and an AUC value of 0.47. ### Conclusion The TeaMs - RL method generates high - quality instruction data sets through reinforcement learning, effectively reducing the dependence on human annotators and external model queries, improving the performance and privacy protection ability of LLMs, and providing a more economical and sustainable method for the training of LLMs.

TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning