A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems
Weijian Zheng,Dali Wang,Fengguang Song
DOI: https://doi.org/10.1145/3589188
2023-03-23
ACM Transactions on Parallel Computing
Abstract:Graph optimization problems (such as minimum vertex cover, maximum cut, travelling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO , a high performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both RL environment and the policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time-complexity or policy model is small. In this work, we distribute large-scale graphs across distributed GPUs, and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism, and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost, and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that 1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs; 2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34]; and 3) its inference performance achieves a 26 times improvement over Gorila.