Abstract:Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying performance issues more efficiently and accurately than traditional methods.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the diagnosis and repair problems encountered during the operation of AI clusters (AI clusters), especially the efficient and accurate fault detection and repair in cases of performance degradation and other operational abnormalities. Specifically, the paper proposes an autonomous intelligent cluster diagnosis system based on large - language - model (LLM) agents and its evaluation framework to enhance the resilience of the cluster. The following are the main objectives of this research: 1. **Autonomous Diagnosis and Repair**: By integrating technologies such as LLM, retrieval - enhanced generation (RAG), and diagrams of thought (DoT), develop an intelligent system that can autonomously diagnose and solve problems within AI clusters. 2. **Improve Efficiency and Accuracy**: Compared with traditional methods, the new LLM - agent system can identify and repair problems in a shorter time, significantly reducing the troubleshooting time. For example, in simulation experiments, the new system can discover and repair the under - clocked GPU problem in just a few minutes, while traditional methods may take nearly an hour. 3. **Predictive Maintenance**: The LLM - agent can detect potential problems and initiate corrective measures before human operators notice performance degradation, thereby improving the availability and reliability of the entire cluster. 4. **Knowledge Base and Algorithm Optimization**: Establish a knowledge base specifically for cluster diagnosis and improve the LLM algorithm to better meet the needs of this field, ensuring that the LLM can effectively handle cluster - specific tasks. 5. **Deployment Strategies and Benchmark Testing**: Propose strategies for the actual deployment of agents and develop a set of benchmark testing tools specifically for evaluating the capabilities of the LLM to verify its superior performance in cluster diagnosis. In summary, this paper is committed to achieving the intelligence and automation of AI cluster management through innovative technical means, thereby significantly improving the stability and performance of the cluster.

Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework

Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework

Dial-In LLM: Human-Aligned Dialogue Intent Clustering with LLM-in-the-loop

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

D-Bot: Database Diagnosis System using Large Language Models

Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset

Characterization of Large Language Model Development in the Datacenter

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

Cooperation on the Fly: Exploring Language Agents for Ad Hoc Teamwork in the Avalon Game

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

KoMA: Knowledge-driven Multi-agent Framework for Autonomous Driving with Large Language Models

Collaborative deep learning framework for fault diagnosis in distributed complex systems

Diagnosing Robotics Systems Issues with Large Language Models