Real-Time Diagnosis of Configuration Errors for Software of AI Server Infrastructure

Guangquan Xu,Xinru Ding,Sihan Xu,Yan Jia,Shaoying Liu,Shicheng Feng,Xi Zheng
DOI: https://doi.org/10.1109/tdsc.2023.3266007
2023-01-01
IEEE Transactions on Dependable and Secure Computing
Abstract:Artificial intelligence (AI) server infrastructure has been built to support AI applications and handle data-intensive workloads. AI server infrastructure is the essential building blocks, and errors in AI server infrastructure may lead to fatal consequences to any AI applications built upon it. Compared to traditional software, software for AI server infrastructure is more configurable, and thus more likely to have configuration errors that might prevent correct software behaviors. Previous work on misconfiguration diagnosis requires sufficient execution history or manual intervention, and can hardly diagnose potential misconfigurations which are not triggered at launching. In this paper, we propose a real-time method to address these issues. Specifically, we combine program analysis and real-time log parsing to diagnose configuration errors. It maps each configuration option to the log code by applying program slicing only once, and parses real-time logs during the operation of the AI server without manual intervention. We evaluate the effectiveness of our approach on the core components of Hadoop, an exemplar AI Server Infrastructure Software. The results show that our method mapped more than 80% of the configuration options to log outputs, identified 90% of the configuration read sites as the slicing seeds, and successfully diagnosed about 10% configuration errors that can not be addressed by previous studies.
What problem does this paper attempt to address?