Adaptive Asynchronous Work-Stealing for distributed load-balancing in heterogeneous systems

João B. Fernandes,Ítalo A. S. de Assis,Idalmis M. S. Martins,Tiago Barros,Samuel Xavier-de-Souza
2024-01-24
Abstract:Supercomputers have revolutionized how industries and scientific fields process large amounts of data. These machines group hundreds or thousands of computing nodes working together to execute time-consuming programs that require a large amount of computational resources. Over the years, supercomputers have expanded to include new and different technologies characterizing them as heterogeneous. However, executing a program in a heterogeneous environment requires attention to a specific aspect of performance degradation: load imbalance. In this research, we address the challenges associated with load imbalance when scheduling many homogeneous tasks in a heterogeneous environment. To address this issue, we introduce the concept of adaptive asynchronous work-stealing. This approach collects information about the nodes and utilizes it to improve work-stealing aspects, such as victim selection and task offloading. Additionally, the proposed approach eliminates the need for extra threads to communicate information, thereby reducing overhead when implementing a fully asynchronous approach. Our experimental results demonstrate a performance improvement of approximately 10.1\% compared to other conventional and state-of-the-art implementations.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily addresses the issue of load imbalance when executing distributed applications in a heterogeneous supercomputing environment. Specifically, the research team proposed the Adaptive Asynchronous Work-Stealing (A2WS) algorithm to optimize load balancing, reduce communication overhead, and improve the scalability of programs across various types of heterogeneous nodes. Below is a summary of the main problems addressed by the paper and the proposed solutions: 1. **Load Imbalance Issue**: - In a heterogeneous environment, due to the varying performance of different computing nodes, load imbalance can easily occur. This means some nodes complete their tasks early and wait for other slower nodes to finish, leading to resource wastage. 2. **Solutions**: - **Adaptive Asynchronous Work-Stealing (A2WS)**: This is a new scheduling method aimed at minimizing communication overhead and achieving efficient scalability of programs across various types of heterogeneous nodes. - **Limited Information Propagation**: Global information propagation is conducted through local interactions and one-sided MPI communication, reducing unnecessary information exchange and lowering communication overhead. - **Intelligent Stealing**: The collected information is used to adjust the number of tasks to be stolen, avoiding the network burden caused by frequent stealing and improving load balancing efficiency. - **Asynchronous Stealing Mechanism**: A fully asynchronous task stealing process is implemented, enhancing the flexibility and efficiency of task redistribution. - **Predictive Stealing**: Allows task stealing to begin immediately after the first task is completed, avoiding empty queues and thus reducing idle time. - **Victim Selection**: An optimized method is proposed to select appropriate victim nodes, reducing the number of steals and improving overall performance. Through these methods, A2WS can effectively balance the load in a heterogeneous environment and improve the utilization of computing resources. Experimental results show that compared to traditional and state-of-the-art implementations, A2WS can bring about a performance improvement of approximately 10.1%.