Abstract:Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. Large language models (LLMs) that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability levels. The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese. By conducting a comprehensive performance evaluation of the current leading large language models, we show how various LLM techniques can affect the performance of Ops, and discussed findings related to various topics, including model quantification, QA evaluation, and hallucination issues. To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions. At the same time, we have open-sourced 20% of the test QA to assist current researchers in preliminary evaluations of their OpsLLM models. The remaining 80% of the data, which is not disclosed, is used to eliminate the issue of the test set leakage. Additionally, we have constructed an online leaderboard that is updated in real-time and will continue to be updated, ensuring that any newly emerging LLMs will be evaluated promptly. Both our dataset and leaderboard have been made public.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Existing natural language processing (NLP) benchmark tests cannot effectively evaluate the performance of large language models (LLMs) in the field of IT operations (Ops). Specifically, existing general NLP benchmarks such as C - Eval and MMLU, as well as commonly used evaluation metrics such as BLEU and ROUGE, cannot fully reflect the actual effectiveness of LLMs in operation and maintenance tasks. Therefore, there is an urgent need for a comprehensive benchmark test suite specifically for the IT operation and maintenance field to guide the selection of suitable LLMs and optimize the performance of these models in operation and maintenance tasks. To meet this challenge, the author proposes **OpsEval**, which is a comprehensive benchmark test suite for evaluating the capabilities of large language models in the IT operation and maintenance field. OpsEval includes: 1. **Operation - oriented evaluation data set**: It contains 7,184 multiple - choice questions and 1,736 short - answer questions, covering multiple sub - fields. 2. **Operation and maintenance evaluation benchmark**: Multiple evaluation scenarios are designed, such as self - consistency, chain - of - thought, and in - context learning. 3. **Specially designed question - and - answer evaluation method**: FAE - Score is proposed to evaluate the performance of the model from three dimensions: fluency, accuracy, and evidence, and the correlation coefficient with human expert scores reaches 0.9175. Through these measures, OpsEval aims to provide a comprehensive and authoritative evaluation framework to help researchers and practitioners better understand the performance of existing LLMs in operation and maintenance tasks and provide guidance for future model optimization. ### Key problem summary - **Sensitive data**: Operation and maintenance data are usually sensitive and proprietary and difficult to obtain publicly. - **Sub - field diversity**: Operation and maintenance involve multiple sub - fields, and each sub - field requires different capabilities and term explanations. - **Prompt engineering sensitivity**: Due to the particularity of the operation and maintenance field, existing LLMs are very sensitive to prompt engineering. - **Insufficient evaluation metrics**: Existing evaluation metrics such as BLEU and ROUGE cannot accurately reflect the actual effects of operation and maintenance tasks. Through OpsEval, the author hopes to fill this gap and provide a reliable tool for the evaluation of LLMs in the operation and maintenance field.

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

OWL: A Large Language Model for IT Operations

An Empirical Study of NetOps Capability of Pre-Trained Large Language Models

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

A Survey on Evaluation of Large Language ModelsJust Accepted

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Large Language Models in Healthcare: A Comprehensive Benchmark

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language Models

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

PsyEval: A Comprehensive Large Language Model Evaluation Benchmark for Mental Health

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation