A Novel Classification Model to Predict Batch Job Failures in Co-located Cloud

Yurui Li,Weiwei Lin,Keqin Li,James Z. Wang,Fagui Liu,Jie Liu
DOI: https://doi.org/10.1109/icpads51040.2020.00080
2020-01-01
Abstract:Nowadays, cloud co-location is often used for data centers to improve the utilization of computing resources. However, batch jobs in a Co-location Datacenter (CLD) are vulnerable to failures due to the competition for limited resources with online service jobs. Such failed batch jobs would be rescheduled and failed repeatedly, resulting in the waste of computing resources and instability of the computing clusters. Therefore, we propose a method to accurately predict the potential failures of batch jobs for CLD. The core of the proposed method is STLF (SMOTE Tomek and LightGBM [5] Framework), which is divided into three parts. First, we use the co-feature extraction method to generate Co-located Feature Dataset (CLFD). Then SMOTE Tomek is used to oversampling the CLFD to ensure that the classifier can learn more minority features. Finally, we use LightGBM classifier to predict batch jobs' failure. The performance experiments conducted on the Ali Trace 2018 dataset show that our proposed STLF significantly outperforms the existing popular classifiers in terms of the ROC curve, the area under the ROC curve (AUC), precision, and recall.
What problem does this paper attempt to address?