Abstract:Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low GPU utilization in the generation stage and training stage of the current Reinforcement Learning from Human Feedback (RLHF) training framework. Specifically: 1. **Data Skew Problem in the Generation Stage**: The response lengths of generated samples exhibit a long - tailed distribution, that is, in each batch, there will be some samples significantly longer than others. Due to data dependency, the inference task cannot start until the generation task is completed. Therefore, even if there are only a few long - tailed samples, it will force the inference task to wait, resulting in extremely low GPU utilization in the generation stage. 2. **Pipeline Bubble Problem in the Training Stage**: With the exponential growth of the number of parameters in large - scale language models (LLM), a higher degree of pipeline parallelism is required to expand training. However, the proportion of pipeline bubbles increases as the pipeline parallelism increases, which significantly reduces the training efficiency. For example, in the common 1F1B pipeline scheduling, when the LLM expands to tens of billions of parameters, about half of the GPUs are idle during the training process, causing a waste of resources. To solve these problems, the paper proposes **RLHFuse**, an efficient RLHF training framework that improves training throughput through the following two fusion techniques: - **Inter - stage Fusion**: Sub - divide the generation and inference tasks into sample - level subtasks, and by dynamically migrating long - tailed samples to dedicated instances, the inference task can be executed overlapping with the generation of long - tailed samples, thereby increasing GPU utilization. - **Intra - stage Fusion**: Sub - divide the training task into micro - batch subtasks, and by taking advantage of the natural independence between two independent models, reduce pipeline bubbles through a fused pipeline schedule to improve training efficiency. In addition, RLHFuse also introduces a series of system optimization measures, making it a production - level framework that supports RLHF training for internal products. Experimental results show that compared with the existing state - of - the - art solutions, the training throughput of RLHFuse is increased by up to 3.7 times. In summary, the paper aims to break the task - level view of the traditional RLHF workflow through fine - grained subtask - level optimization, thereby significantly increasing GPU utilization in the generation and training stages and improving the overall training efficiency.

RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation

HybridFlow: A Flexible and Efficient RLHF Framework

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

Cool-Fusion: Fuse Large Language Models without Training

RRHF: Rank Responses to Align Language Models with Human Feedback

RLHF Workflow: From Reward Modeling to Online RLHF

Prototypical Reward Network for Data-Efficient RLHF

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

The Perfect Blend: Redefining RLHF with Mixture of Judges

ProFuser: Progressive Fusion of Large Language Models

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment

Understanding and Alleviating Memory Consumption in RLHF for LLMs

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models