mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

Wei Zhang,Hongcheng Guo,Jian Yang,Yi Zhang,Chaoran Yan,Zhoujin Tian,Hangyuan Ji,Zhoujun Li,Tongliang Li,Tieqiao Zheng,Chao Chen,Yi Liang,Xu Shi,Liangfan Zheng,Bo Zhang
2024-05-04
Abstract:The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause analysis (RCA) and resolution of alert events, we propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), to revolutionize the AI for IT operations (AIOps) domain, where multiple agents based on the powerful large language models (LLMs) perform blockchain-inspired voting to reach a final agreement following a standardized process for processing tasks and queries provided by Agent Workflow. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. To avoid potential instability issues in LLMs and fully leverage the transparent and egalitarian advantages inherent in a decentralized structure, mABC adopts a decision-making process inspired by blockchain governance principles while considering the contribution index and expertise index of each agent. Experimental results on the public benchmark AIOps challenge dataset and our created train-ticket dataset demonstrate superior performance in accurately identifying root causes and formulating effective solutions, compared to previous strong baselines. The ablation study further highlights the significance of each component within mABC, with Agent Workflow, multi-agent, and blockchain-inspired voting being crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and achieves a significant improvement in the AIOps domain compared to existing baselines
Multiagent Systems,Cryptography and Security,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the complexity and challenges of root cause analysis (RCA) in the micro - services architecture (MSA). Specifically, the paper focuses on the following issues: 1. **Maintenance of system stability and efficiency in the micro - services architecture**: - With the development of cloud - native technologies, the complexity of the micro - services architecture keeps increasing, making it more difficult to maintain system stability and efficiency. - In distributed deployment practices, faults propagate among service nodes, and alarm events become increasingly complex, making root cause analysis and resolution extremely difficult. 2. **Limitations of existing root cause analysis methods**: - Existing root cause analysis methods (such as MicroScope, TraceAnomaly, MEPFL, etc.) cannot handle circular dependencies well and are highly dependent on data supervision and high - coverage fault types. - These methods mainly focus on call - based or trace - based methods, but in complex micro - services architectures, these methods are difficult to deal with cross - node fault analysis. 3. **Utilizing the potential of large language models (LLMs) and multi - agent systems**: - Although some tools (such as RCACopilot, RCAgent, D - Bot) have attempted to improve root cause analysis tools through event matching, information aggregation, and domain knowledge, they are still difficult to handle cross - node fault analysis in complex micro - services architectures. - The rapid development of large language models (LLMs) in the natural language processing field has provided new possibilities for root cause analysis, but how to effectively integrate these models to improve the accuracy and reliability of root cause analysis remains a challenge. ### Proposed solutions To this end, the paper proposes a multi - agent blockchain - inspired collaborative framework (mABC) for root cause analysis in the micro - services architecture. The main features of mABC include: - **Multi - agent collaboration**: mABC contains seven specialized agents. Each agent collaborates in a decentralized chain according to its expertise and internal software knowledge to provide valuable insights. - **Blockchain - inspired voting mechanism**: To avoid the potential instability of large language models (LLMs) and fully utilize the transparency and fairness advantages of the decentralized structure, mABC adopts a decision - making process based on blockchain governance principles, considering each agent's contribution index and professional index. - **Agent Workflow**: Standardize the task processing process and optimize task assignment and processing through task difficulty and dynamic context awareness. Through these innovations, mABC can more accurately identify root causes and develop effective solutions in complex micro - services architectures, significantly improving performance in the AIOps field. Experimental results show that mABC outperforms existing baseline methods on the public benchmark AIOps challenge dataset and the self - created Train - Ticket dataset.