Site Reliability Engineering: Application of Item Response Theory to Application Deployment Practices and Controls

Kiran Mahesh ND
DOI: https://doi.org/10.48550/arXiv.2008.06717
2020-08-15
Abstract:Reliability of an application or solution in production environment is one of the fundamental features where every SRE team is critically focused upon. At the same time achieving extreme reliability comes with the cost which include but not limited to slow pace of new feature deployments, operations cost and opportunity cost. One such earlier effort in giving an objective metric to strike the fine balance between acceptable reliability and product velocity is error budget and its associated policy. There are also contemporary deployment guidelines and controls per organization to ascertain the reliability of an application deployment version into customer facing or production environments. This work proposes new objective metrics called Application Deployment Score estimated using dichotomous Item Response Theory model. This score is used to assess the improvement trend of each application version deployed into customer facing environment, identify the improvement scope for each application deployment in each area of deployment guidelines and controls, adjust the error budget i.e. soft error budget of a interdependent application in application mesh by giving soft collective responsibility and finally defines a new metric called deployment index which helps to assess the effectiveness of these contemporary deployment guidelines and controls in upholding the agreed SLOs of the application in customer facing environments. This study opens a new field of research in developing new underlying latent indexes (i.e. new objective metrics) in SRE and DevOps space.
Software Engineering,Machine Learning
What problem does this paper attempt to address?