RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

Yinmin Zhong,Zili Zhang,Bingyang Wu,Shengyu Liu,Yukun Chen,Changyi Wan,Hanpeng Hu,Lei Xia,Ranchen Ming,Yibo Zhu,Xin Jin
2024-09-26
Abstract:Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.
Machine Learning,Computation and Language,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low GPU utilization in the generation stage and training stage of the current Reinforcement Learning from Human Feedback (RLHF) training framework. Specifically: 1. **Data Skew Problem in the Generation Stage**: The response lengths of generated samples exhibit a long - tailed distribution, that is, in each batch, there will be some samples significantly longer than others. Due to data dependency, the inference task cannot start until the generation task is completed. Therefore, even if there are only a few long - tailed samples, it will force the inference task to wait, resulting in extremely low GPU utilization in the generation stage. 2. **Pipeline Bubble Problem in the Training Stage**: With the exponential growth of the number of parameters in large - scale language models (LLM), a higher degree of pipeline parallelism is required to expand training. However, the proportion of pipeline bubbles increases as the pipeline parallelism increases, which significantly reduces the training efficiency. For example, in the common 1F1B pipeline scheduling, when the LLM expands to tens of billions of parameters, about half of the GPUs are idle during the training process, causing a waste of resources. To solve these problems, the paper proposes **RLHFuse**, an efficient RLHF training framework that improves training throughput through the following two fusion techniques: - **Inter - stage Fusion**: Sub - divide the generation and inference tasks into sample - level subtasks, and by dynamically migrating long - tailed samples to dedicated instances, the inference task can be executed overlapping with the generation of long - tailed samples, thereby increasing GPU utilization. - **Intra - stage Fusion**: Sub - divide the training task into micro - batch subtasks, and by taking advantage of the natural independence between two independent models, reduce pipeline bubbles through a fused pipeline schedule to improve training efficiency. In addition, RLHFuse also introduces a series of system optimization measures, making it a production - level framework that supports RLHF training for internal products. Experimental results show that compared with the existing state - of - the - art solutions, the training throughput of RLHFuse is increased by up to 3.7 times. In summary, the paper aims to break the task - level view of the traditional RLHF workflow through fine - grained subtask - level optimization, thereby significantly increasing GPU utilization in the generation and training stages and improving the overall training efficiency.