Abstract:Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at <a class="link-external link-https" href="https://github.com/openpsi-project/ReaLHF" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the training process of reinforcement learning from human feedback (RLHF) for large - scale language models (LLM), how to improve training efficiency and resource utilization. Specifically, existing RLHF systems directly adopt the parallelization techniques in supervised training, which leads to two main problems: 1. **Over - parallelization**: When the system adopts the same parallelization strategy for each GPU node, it will lead to a large amount of synchronization and communication overhead, thus reducing the performance of the overall system. 2. **Insufficient resource utilization caused by asymmetric parallelization**: Different computing tasks require different parallelization strategies, but the fixed task allocation method will cause some GPUs to be idle and fail to fully utilize hardware resources. To solve these problems, the paper proposes a new method - parameter reallocation, that is, dynamically adjusting the distribution of model parameters among different GPUs during the training process. In this way, redundant communication can be eliminated and the utilization rate of GPUs can be maximized, thereby significantly improving the efficiency of RLHF training. ### Main contributions 1. **Propose a method for dynamically reallocating model parameters**: Dynamically adjust the distribution of model parameters among different GPUs to meet the needs of different computing tasks. 2. **Introduce a general formulating method and an effective search algorithm**: Used to discover efficient RLHF execution plans. 3. **Design and implement the ReaLHF system**: This system can automatically discover and run fast execution plans with high throughput. 4. **Conduct a comprehensive experimental evaluation**: It shows that ReaLHF has a significant performance improvement compared to the baseline system, with a speed increase of 2.0 to 10.6 times, and in specific cases, the performance is improved by 80%. ### Technical details The ReaLHF system consists of two parts: - **Execution plan generator**: Use the Markov Chain Monte Carlo (MCMC) algorithm for searching, and combine it with a lightweight cost estimator to find the optimal execution plan. - **Runtime engine**: According to the generated execution plan, effectively parallelize the calculation and reallocate the model parameters. Through these innovations, ReaLHF can achieve higher efficiency and better resource utilization in the RLHF training of large - scale language models.

ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

HybridFlow: A Flexible and Efficient RLHF Framework

RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion

Parameter Efficient Reinforcement Learning from Human Feedback

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

RLHF Workflow: From Reward Modeling to Online RLHF

Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

RRHF: Rank Responses to Align Language Models with Human Feedback

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method

ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models

Secrets of RLHF in Large Language Models Part I: PPO

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

ALaRM: Align Language Models via Hierarchical Rewards Modeling

Prototypical Reward Network for Data-Efficient RLHF

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

Mitigating the Alignment Tax of RLHF