TJOSConf: Automatic and Safe System Environment Operations Platform

Yida Wang,Shuangshuang Jiang,Bin Cui
DOI: https://doi.org/10.1145/3524304.3524308
2022-01-01
Abstract:With exploding number of servers in large IT corporations, system environment management of servers at scale is a big challenge. System environment needs to be regularly updated to meet the system demand of services or fix emerging bugs. It is of vital importance to ensure stability of services during these system updates. However, the heterogeneity of workload and diversity of update scripts make it difficult to estimate how a system update affects the servers and services running on them. Unexpected failures caused by various reasons can also arise during or after update execution, making the situation even worse. This paper aims to solve the system management challenge with an operations platform called TJOSConf. During system updates, TJOSConf is able to interact with services to make sure that expected impacts and unexpected failures can be recognized and handled in time, making system update highly automatic and safe. Various system update scripts can be easily incorporated into TJOSConf as plugins, making the platform scalable. Use cases of TJOSConf in Alibaba prove its effectiveness.
What problem does this paper attempt to address?