Characterization of operational failures from a business data processing SaaS platform

Catello Di Martino,Zbigniew Kalbarczyk,Ravishankar K. Iyer,Geetika Goel,Santonu Sarkar,Rajeshwari Ganesan
DOI: https://doi.org/10.1145/2591062.2591172
2014-05-31
Abstract:This paper characterizes operational failures of a production Custom Package Good Software-as-a-Service (SaaS) platform. Events log collected over 283 days of in-field operation are used to characterize platform failures. The characterization is performed by estimating (i) common failure types of the platform, (ii) key factors impacting platform failures, (iii) failure rate, and (iv) how user workload (files submitted for processing) impacts on the failure rate. The major findings are: (i) 34.1% of failures are caused by unexpected values in customers' data, (ii) nearly 33% of the failures are because of timeout, and (iii) the failure rate increases if the workload intensity (transactions/second) increases, while there is no statistical evidence of being influenced by the workload volume (size of users' data). Finally, the paper presents the lessons learned and how the findings and the implemented analysis tool allow platform developers to improve platform code, system settings and customer management.
What problem does this paper attempt to address?