Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Honghao Shi,Longkai Cheng,Wenli Wu,Yuhang Wang,Xuan Liu,Shaokai Nie,Weixv Wang,Xuebin Min,Chunlei Men,Yonghua Lin
2024-11-08
Abstract:Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying performance issues more efficiently and accurately than traditional methods.
Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the diagnosis and repair problems encountered during the operation of AI clusters (AI clusters), especially the efficient and accurate fault detection and repair in cases of performance degradation and other operational abnormalities. Specifically, the paper proposes an autonomous intelligent cluster diagnosis system based on large - language - model (LLM) agents and its evaluation framework to enhance the resilience of the cluster. The following are the main objectives of this research: 1. **Autonomous Diagnosis and Repair**: By integrating technologies such as LLM, retrieval - enhanced generation (RAG), and diagrams of thought (DoT), develop an intelligent system that can autonomously diagnose and solve problems within AI clusters. 2. **Improve Efficiency and Accuracy**: Compared with traditional methods, the new LLM - agent system can identify and repair problems in a shorter time, significantly reducing the troubleshooting time. For example, in simulation experiments, the new system can discover and repair the under - clocked GPU problem in just a few minutes, while traditional methods may take nearly an hour. 3. **Predictive Maintenance**: The LLM - agent can detect potential problems and initiate corrective measures before human operators notice performance degradation, thereby improving the availability and reliability of the entire cluster. 4. **Knowledge Base and Algorithm Optimization**: Establish a knowledge base specifically for cluster diagnosis and improve the LLM algorithm to better meet the needs of this field, ensuring that the LLM can effectively handle cluster - specific tasks. 5. **Deployment Strategies and Benchmark Testing**: Propose strategies for the actual deployment of agents and develop a set of benchmark testing tools specifically for evaluating the capabilities of the LLM to verify its superior performance in cluster diagnosis. In summary, this paper is committed to achieving the intelligence and automation of AI cluster management through innovative technical means, thereby significantly improving the stability and performance of the cluster.