Abstract:Cloud computing (CC) benefits and opportunities are among the fastest growing technologies in the computer industry. Cloud computing's challenges include resource allocation, security, quality of service, availability, privacy, data management, performance compatibility, and fault tolerance. Fault tolerance (FT) refers to a system's ability to continue performing its intended task in the presence of defects. Fault-tolerance challenges include heterogeneity and a lack of standards, the need for automation, cloud downtime reliability, consideration for recovery point objects, recovery time objects, and cloud workload. The proposed research includes machine learning (ML) algorithms such as naïve Bayes (NB), library support vector machine (LibSVM), multinomial logistic regression (MLR), sequential minimal optimization (SMO), K-nearest neighbor (KNN), and random forest (RF) as well as a fault-tolerance method known as delta-checkpointing to achieve higher accuracy, lesser fault prediction error, and reliability. Furthermore, the secondary data were collected from the homonymous, experimental high-performance computing (HPC) system at the Swiss Federal Institute of Technology (ETH), Zurich, and the primary data were generated using virtual machines (VMs) to select the best machine learning classifier. In this article, the secondary and primary data were divided into two split ratios of 80/20 and 70/30, respectively, and cross-validation (5-fold) was used to identify more accuracy and less prediction of faults in terms of true, false, repair, and failure of virtual machines. Secondary data results show that naïve Bayes performed exceptionally well on CPU-Mem mono and multi blocks, and sequential minimal optimization performed very well on HDD mono and multi blocks in terms of accuracy and fault prediction. In the case of greater accuracy and less fault prediction, primary data results revealed that random forest performed very well in terms of accuracy and fault prediction but not with good time complexity. Sequential minimal optimization has good time complexity with minor differences in random forest accuracy and fault prediction. We decided to modify sequential minimal optimization. Finally, the modified sequential minimal optimization (MSMO) algorithm with the fault-tolerance delta-checkpointing (D-CP) method is proposed to improve accuracy, fault prediction error, and reliability in cloud computing.

SHelp: Automatic Self-Healing for Multiple Application Instances in a Virtual Machine Environment.

HAaaS: Towards Highly Available Distributed Systems.

Stability Optimization of Dynamic Migration Algorithm for Post-Copy of Virtual Machine Based on KVM

Self-Checkpoint

Supporting Fault-Tolerance for Time-Critical Events in Distributed Environments

Self-Healing by Means of Runtime Execution Profiling

A hybrid fault tolerance framework for SaaS services based on hidden Markov model

Optimizing the Performance of Virtual Machine Synchronization for Fault Tolerance

Improving the Performance of Hypervisor-Based Fault Tolerance

Research on Fault Tolerance in Hybrid P2P-based Collaborative Systems

Achelous: Enabling Programmability, Elasticity, and Reliability in Hyperscale Cloud Networks.

Virtualization-based autonomic resource management for multi-tier Web applications in shared data center

Supporting Reconfigurable Fault Tolerance on Application Servers

A checkpointing/recovery system for MPI applications on cluster of IA-64 computers

Achieving Reliability in Cloud Computing by a Novel Hybrid Approach

Consequence Oriented Self-Healing and Autonomous Diagnosis for Highly Reliable Systems and Software

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

V-Shuttle - Scalable and Semantics-Aware Hypervisor Virtual Device Fuzzing.

Paratus: Instantaneous Failover Via Virtual Machine Replication

HEAL: Performance Troubleshooting Deep inside Data Center Hosts

A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism