DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents

Taiyi Wang,Zhihao Wu,Jianheng Liu,Jianye Hao,Jun Wang,Kun Shao
2024-10-25
Abstract:On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing,Systems and Control
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to achieve efficient and reliable online reinforcement learning (RL) fine - tuning on mobile devices, in order to improve the performance of multimodal large language models (MLLMs) in device control tasks. Specifically, the paper addresses the following key challenges: 1. **Limited data availability and inefficient online training process**: - When performing control tasks on mobile devices, due to limited data acquisition and inefficient online training, it is difficult to effectively fine - tune MLLMs. - Existing offline datasets cannot capture the dynamic changes of mobile applications and environments, resulting in poor performance of models trained on these datasets in actual deployment. 2. **Complexity of distributed asynchronous data collection and training**: - Asynchronous data collection introduces algorithmic difficulties, such as non - stationary data distributions that hinder convergence, and the delay between policy updates and data collection may lead to performance degradation. - In a distributed environment, the data collection rates and times of different devices are not synchronized, increasing the difficulty of maintaining consistency and stability. 3. **Limitations of existing methods**: - Previous work relied on complex wrappers or static data training and could not adapt to the ever - changing real - world environment. - Even the most advanced multimodal large language models (such as GPT - 4V) have limitations when handling GUI control tasks, especially in error recovery and behavioral rationality. To solve these problems, the paper proposes the DistRL framework, a new distributed reinforcement learning fine - tuning framework, which aims to improve the performance of mobile device control agents in the following ways: - **Asynchronous distributed architecture**: Adopt a centralized training and decentralized data collection approach to ensure efficient online fine - tuning. - **Custom - designed RL algorithm**: Design a new off - policy reinforcement learning algorithm A - RIDE, which can effectively balance exploration and exploitation and give priority to using valuable empirical data to ensure stable and efficient training. - **Distributed prioritized experience replay (DPER)**: By prioritizing important trajectories in the replay buffer, improve sample utilization and accelerate convergence. Experimental results show that compared with existing synchronous multi - machine methods, DistRL improves training efficiency by 3 times, data collection speed by 2.4 times, and relatively increases the success rate of general Android tasks by 20%. This verifies the scalability and efficiency of DistRL in real - world device control tasks. In summary, by proposing the DistRL framework, this paper solves the key challenges of online reinforcement learning fine - tuning in mobile device control tasks and significantly improves training efficiency and agent performance.