Abstract:Cloud computing has become a popular technology for executing scientific workflows. However, with a large number of hosts and virtual machines (VMs) being deployed, the cloud resource failures, such as the permanent failure of hosts (HPF), the transient failure of hosts (HTF), and the transient failure of VMs (VMTF), bring the service reliability problem. Therefore, fault tolerance for time-consuming scientific workflows is highly essential in the cloud. However, existing fault-tolerant (FT) approaches consider only one or two above failure types and easily neglect the others, especially for the HTF. This paper proposes a Real-time and dynamic Fault-tolerant Scheduling (ReadyFS) algorithm for scientific workflow execution in a cloud, which guarantees deadline constraints and improves resource utilization even in the presence of any resource failure. Specifically, we first introduce two FT mechanisms, i.e., the replication with delay execution (RDE) and the checkpointing with delay execution (CDE), to cope with HPF and VMTF, simultaneously. Additionally, the rescheduling (ReSC) is devised to tackle the HTF that affects the resource availability of the entire cloud datacenter. Then, the resource adjustment (RA) strategy, including the resource scaling-up (RS-Up) and the resource scaling-down (RS-Down), is used to adjust resource demands and improve resource utilization dynamically. Finally, the ReadyFS algorithm is presented to schedule real-time scientific workflows by combining all the above FT mechanisms with RA strategy. We conduct the performance evaluation with real-world scientific workflows and compare ReadyFS with five vertical comparison algorithms and three horizontal comparison algorithms. Simulation results confirm that ReadyFS is indeed able to guarantee the fault tolerance of scientific workflow execution and improve cloud resource utilization.

Research of Scheduling Strategy Based on Fault Tolerance in Hadoop Platform

Fault Tolerant Real-Time Scheduling Strategy for NC System Based on Rollback Recovery

An Efficient Fault-Tolerant Scheduling Algorithm for Periodic Real-Time Tasks in Heterogeneous Platforms

Research on Real-Time Scheduling Strategy for Transient Fault Tolerance in Nc System

Real-time and Dynamic Fault-Tolerant Scheduling for Scientific Workflows in Clouds

Evaluating Performance Of Rescheduling Strategies In Cloud System

Performance optimization of computing task scheduling based on the Hadoop big data platform

A Novel PageRank-Based Fault Handling Strategy for Workflow Scheduling in Cloud Data Centers

A Practical Cross-Datacenter Fault-Tolerance Algorithm in the Cloud Storage System.

Fault-tolerant real-time tasks scheduling with dynamic fault handling

Optimized Scheduling Algorithm Oriented to Numerical Control System

Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

A Dependable Task Scheduling Strategy for a Fault Tolerant Grid Model

Efficient Scheduling Algorithm for Hard Real-Time Tasks in Primary-Backup Based Multiprocessor Systems

A Real-Time Scheduling Strategy Based on Processing Framework of Hadoop

A Multi-objective Virtual Machine Scheduling Algorithm in Fault Tolerance Aware Cloud Environments

Hadoop Scheduling Base On Data Locality

Moving Hadoop into the Cloud with Flexible Slot Management and Speculative Execution

Predicting Scheduling Failures in the Cloud

A planned scheduling process of cloud computing by an effective job allocation and fault-tolerant mechanism

Quantitative Fault-Tolerance for Reliable Workflows on Heterogeneous IaaS Clouds