Abstract:Nowadays, increasing number of services are provided to individuals and organizations through cloud computing systems in a pay-as-you-use model. This business service paradigm encounters several cloud Quality of Service (QoS) challenges, such as reliability, cost, and response time. The most common mechanism to improve cloud service reliability is a primary/backup (PB) fault-tolerant technique. However, this reliability enhancement technique inevitably results in multiple replications, which lead to high service cost. In recognition of these challenges, we first build a cloud computing systems resources management architecture. Then, we analyze the cloud service execution reliability on the physical resources of a VM and used a CUDA (Compute Unified Device Architecture)-enabled parallel two-dimensional long short-term memory neural network to predict the software faults of a cloud VM. Third, we propose an effective primary/backup cloud service cost calculation approach. To overcome the cloud service response time constraint, we integrate a response time slack factor into this method. Fourth, we formulate the cloud service reliability and cost aware job scheduling problem, which aims at minimizing the total cloud service cost and rejection rate, and improving the system reliability. Fifthly, a heuristic greedy reliability and cost aware job scheduling (RCJS) algorithm is proposed. Finally, a performance evaluation is conducted and the experimental results demonstrate that our proposed RCJS algorithm significantly outperforms optimal redundant VM placement (OPVMP), MIN-MIN algorithms in terms of average service cost and rejection rate. This algorithm also demonstrates good trade-off of reliability when compared to the other two algorithms and is suitable for cloud services with high reliability and low-cost requirements.

Analysis of Frequently Failing Tasks and Rescheduling Strategy in the Cloud System

Hunting Killer Tasks for Cloud System Through Machine Learning: A Google Cluster Case Study

Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Hunting Killer Tasks for Cloud System through Behavior Pattern Learning

Evaluating Performance Of Rescheduling Strategies In Cloud System

A Novel Job Scheduling Model to Enhance Efficiency and Overall User Fairness of Cloud Computing Environment.

Online Cost-Rejection Rate Scheduling for Resource Requests in Hybrid Clouds

Predicting Scheduling Failures in the Cloud

Task Scheduling in Cloud Computing Based on The Cuckoo Search Algorithm

A Policy of Task Allocation Base on Distributed Cluster Computing Towards Cloud

Task rescheduling optimization to minimize network resource consumption

Processing time analysis of cloud services with retrying fault-tolerance technique

Testing Tasks Management in Testing Cloud Environment

Monitoring-Based Task Scheduling In Large-Scale Saas Cloud

A Task Scheduling Method For Energy-Performance Trade-Off In Clouds

Improving Failure Tolerance in Large-Scale Cloud Computing Systems

Two-level Task Scheduling with Multi-Objectives in Geo-Distributed and Large-Scale SaaS Cloud

Service Cost Effective and Reliability Aware Job Scheduling Algorithm on Cloud Computing Systems

Research of Scheduling Strategy Based on Fault Tolerance in Hadoop Platform

Analyzing the impact of various parameters on job scheduling in the Google cluster dataset

Multi-objective scheduling of many tasks in cloud platforms.