Octopus: an End-to-end Multi-DAG Scheduling Method Based on Deep Reinforcement Learning

Yi Chang,Haosong Peng,Yufeng Zhan,Yuanqing Xia
DOI: https://doi.org/10.23919/ccc63176.2024.10662729
2024-01-01
Abstract:With the rapid growth of cloud computing, more and more vendors are deploying their services to the cloud. Efficient job scheduling is essential for enhancing system operation performance. These services, represented as Directed Acyclic Graphs (DAGs), usually have intricate dependencies. Existing research has limitations in solving the multi-DAG job scheduling problem and often overlooks end-to-end scheduling directly from tasks to servers. For example, scheduling each job individually without considering the overall information of all jobs might lead to an extended total completion time. To address these issues, this paper proposes Octopus, an intelligent end-to-end multi-DAG jobs scheduling algorithm based on deep reinforcement learning. Octopus is designed to address the challenges of dynamic and large input dimensions in the multi-DAG scheduling problem. A graph neural network feature extraction module is designed to extract the topological structure of multi-DAG jobs. The improved kernel-based network is then used to handle dynamic inputs. Simulation experiments conducted on different scales of DAG jobs and servers demonstrate that our approach can reduce the overall completion time of multi-DAG jobs up to 30% compared to traditional scheduling methods.
What problem does this paper attempt to address?