Abstract:The escalating complexity of micro-services architecture in cloud-native technologies poses significant challenges for maintaining system stability and efficiency. To conduct root cause analysis (RCA) and resolution of alert events, we propose a pioneering framework, multi-Agent Blockchain-inspired Collaboration for root cause analysis in micro-services architecture (mABC), to revolutionize the AI for IT operations (AIOps) domain, where multiple agents based on the powerful large language models (LLMs) perform blockchain-inspired voting to reach a final agreement following a standardized process for processing tasks and queries provided by Agent Workflow. Specifically, seven specialized agents derived from Agent Workflow each provide valuable insights towards root cause analysis based on their expertise and the intrinsic software knowledge of LLMs collaborating within a decentralized chain. To avoid potential instability issues in LLMs and fully leverage the transparent and egalitarian advantages inherent in a decentralized structure, mABC adopts a decision-making process inspired by blockchain governance principles while considering the contribution index and expertise index of each agent. Experimental results on the public benchmark AIOps challenge dataset and our created train-ticket dataset demonstrate superior performance in accurately identifying root causes and formulating effective solutions, compared to previous strong baselines. The ablation study further highlights the significance of each component within mABC, with Agent Workflow, multi-agent, and blockchain-inspired voting being crucial for achieving optimal performance. mABC offers a comprehensive automated root cause analysis and resolution in micro-services architecture and achieves a significant improvement in the AIOps domain compared to existing baselines

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the complexity and challenges of root cause analysis (RCA) in the micro - services architecture (MSA). Specifically, the paper focuses on the following issues: 1. **Maintenance of system stability and efficiency in the micro - services architecture**: - With the development of cloud - native technologies, the complexity of the micro - services architecture keeps increasing, making it more difficult to maintain system stability and efficiency. - In distributed deployment practices, faults propagate among service nodes, and alarm events become increasingly complex, making root cause analysis and resolution extremely difficult. 2. **Limitations of existing root cause analysis methods**: - Existing root cause analysis methods (such as MicroScope, TraceAnomaly, MEPFL, etc.) cannot handle circular dependencies well and are highly dependent on data supervision and high - coverage fault types. - These methods mainly focus on call - based or trace - based methods, but in complex micro - services architectures, these methods are difficult to deal with cross - node fault analysis. 3. **Utilizing the potential of large language models (LLMs) and multi - agent systems**: - Although some tools (such as RCACopilot, RCAgent, D - Bot) have attempted to improve root cause analysis tools through event matching, information aggregation, and domain knowledge, they are still difficult to handle cross - node fault analysis in complex micro - services architectures. - The rapid development of large language models (LLMs) in the natural language processing field has provided new possibilities for root cause analysis, but how to effectively integrate these models to improve the accuracy and reliability of root cause analysis remains a challenge. ### Proposed solutions To this end, the paper proposes a multi - agent blockchain - inspired collaborative framework (mABC) for root cause analysis in the micro - services architecture. The main features of mABC include: - **Multi - agent collaboration**: mABC contains seven specialized agents. Each agent collaborates in a decentralized chain according to its expertise and internal software knowledge to provide valuable insights. - **Blockchain - inspired voting mechanism**: To avoid the potential instability of large language models (LLMs) and fully utilize the transparency and fairness advantages of the decentralized structure, mABC adopts a decision - making process based on blockchain governance principles, considering each agent's contribution index and professional index. - **Agent Workflow**: Standardize the task processing process and optimize task assignment and processing through task difficulty and dynamic context awareness. Through these innovations, mABC can more accurately identify root causes and develop effective solutions in complex micro - services architectures, significantly improving performance in the AIOps field. Experimental results show that mABC outperforms existing baseline methods on the public benchmark AIOps challenge dataset and the self - created Train - Ticket dataset.

mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture

Efficient Balancing A* Search for Multi-robot Collaboration with Blockchain Consensus.

RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

Multi-Robot Coordination In Complex Environment With Task And Communication Constraints

Exploring LLM-based Agents for Root Cause Analysis

A Multi-Agent Reinforcement Learning Driven Artificial Bee Colony Algorithm with the Central Controller

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

An AI Agent for Fully Automated Multi‐Omic Analyses

BMW Agents -- A Framework For Task Automation Through Multi-Agent Collaboration

Enhancing Trust in Autonomous Agents: An Architecture for Accountability and Explainability through Blockchain and Large Language Models

Multi-Agent Software Development through Cross-Team Collaboration

Enhancing the Efficiency and Accuracy of Underlying Asset Reviews in Structured Finance: The Application of Multi-agent Framework

ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems

LLM Multi-Agent Systems: Challenges and Open Problems

Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications

CMMD: Cross-Metric Multi-Dimensional Root Cause Analysis

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Blockchain-Based Trust Edge Knowledge Inference of Multi-Robot Systems for Collaborative Tasks

MARCO: Multi-Agent Real-time Chat Orchestration

TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems