ACAS: an Anomaly-Based Cause Aware Auto-Scaling Framework for Clouds

Sara Kardani Moghaddam,Rajkumar Buyya,Kotagiri Ramamohanarao
DOI: https://doi.org/10.1016/j.jpdc.2018.12.002
IF: 4.542
2019-01-01
Journal of Parallel and Distributed Computing
Abstract:Cloud computing as a model to deliver distributed resource and services on the pay-as-you-go policy has become increasingly popular for all organizations including industry. However, the inherent dynamicity in this environment makes it prone to various types of performance problems which introduce many challenges in the area of distributed resource management. Advances in the big data learning approaches can bring the opportunity for a data aware dynamic management of resources in the cloud. The collected data from the performance indicators of the system can be a valuable source of information to identify unusual behaviors in the resource consumptions or application performance. Different types of problems can cause the performance degradations at VM or system level. System administrators are overwhelmed with the huge amount of data to be analyzed to find the problems and overall health of the system. In this paper, we argue that a better selection of dynamic resource scaling policies can be employed for better performance by predicting the anomalies in the system and narrowing down the possible cause of the anomaly to one of the attributes of the system. Therefore, we propose a 2-level cause aware auto-scaling framework which leverages two types of resource management solutions, horizontal and vertical, as the corrective actions when the performance is degraded. We show the effectiveness of vertical scaling strategy as a quick solution for cases that a VM is exposed to some type of the local anomaly, while the horizontal scaling solutions can be used for system wide anomaly to add new VMs in the system. Moreover, our data analysis module can predict anomalies to give sufficient time to the scaling system to make an effective scaling decision. The proposed unsupervised anomaly detection module leverages a new updating strategy for renewing the models which considers the changes in the state of the system to reduce the overhead of recurrent model trainings. We have performed a comparison of the proposed framework with an approach which is used by several popular cloud providers to show the advantage of mixing the multi-level auto-scaling with the knowledge of anomaly detection analysis in resolving performance problems in the cloud.
What problem does this paper attempt to address?