Abstract:Energy consumption and performance metrics have become critical issues for scheduling parallel task-based applications in high-performance computing systems such as cloud datacenters. The duplication and clustering strategy, as well as Dynamic Voltage Frequency Scaling (DVFS) technique, have separately been concentrated on reducing energy consumption and optimizing performance parameters such as throughput and makespan. In this paper, a dual-phase algorithm called EATSDCD which is an energy efficient time aware has been proposed. The algorithm uses the combination of duplication and clustering strategies to schedule the precedence-constrained task graph on datacenter processors through DVFS. The first phase focuses on a smart combination of duplication and clustering strategy to reduce makespan and energy consumed by processors in an effort to execute Directed Acyclic Graph (DAG) while satisfying the throughput constraint. The main idea behind EATSDCD intended to minimize energy consumption in the second phase. After determining the critical path and specifying a set of dependent tasks in non-critical paths, the slack time for each task in non-critical paths was distributed among all dependent tasks in that path. Then, the frequency of DVFS-enabled processors is scaled down to execute non-critical tasks as well as idle and communication phases, without extending the execution time of tasks. Finally, a testbed is developed and different parameters are tested on the randomly generated DAG to evaluate and illustrate the effectiveness of EATSDCD. It was also compared against duplication and clustering-based algorithms and DVFS-based algorithms. In terms of energy consumption and makespan, the results show that our proposed algorithm can save up to 8.3% and 20% energy compared against Power Aware List-based Scheduling (PALS) and Power Aware Task Clustering (PATC) algorithms, respectively. Furthermore, there is 16% improvement over Parallel Pipeline Latency Optimization (PaPilo) algorithm with Encur = 1.2Enmin(G). In comparison with Reliability Aware Scheduling with Duplication (RASD) algorithm, the execution time has been reduced in heterogeneous environments.

EaCO: Resource Sharing Dynamics and Its Impact on Energy Efficiency for DNN Training

Energy-Efficient GPU Clusters Scheduling for Deep Learning

E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster

GreenFlow: A Carbon-Efficient Scheduler for Deep Learning Workloads

Dynamic Resource Allocation for Deep Learning Clusters with Separated Compute and Storage

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs

Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters

Energy-Aware Non-Preemptive Task Scheduling with Deadline Constraint in DVFS-Enabled Heterogeneous Clusters

Improving Cluster Utilization Through Adaptive Resource Management for Deep Neural Network and CPU Jobs Colocation

EATSDCD: A green energy-aware scheduling algorithm for parallel task-based application using clustering, duplication and DVFS technique in cloud datacenters

GPU Cluster Scheduling for Network-Sensitive Deep Learning

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters.

Integrating asynchronous advantage actor–critic (A3C) and coalitional game theory algorithms for optimizing energy, carbon emissions, and reliability of scientific workflows in cloud data centers

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Scaling Deep Learning on GPU and Knights Landing clusters

Qore-DL: A QoS-aware Joint Optimization Framework for Distributed Deep Learning Training

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

CooCo: A Collaborative Offloading and Resource Configuration Algorithm in Edge Networks