Abstract:Log analysis is crucial for ensuring the orderly and stable operation of information systems, particularly in the field of Artificial Intelligence for IT Operations (AIOps). Large Language Models (LLMs) have demonstrated significant potential in natural language processing tasks. In the AIOps domain, they excel in tasks such as anomaly detection, root cause analysis of faults, operations and maintenance script generation, and alert information summarization. However, the performance of current LLMs in log analysis tasks remains inadequately validated. To address this gap, we introduce LogEval, a comprehensive benchmark suite designed to evaluate the capabilities of LLMs in various log analysis tasks for the first time. This benchmark covers tasks such as log parsing, log anomaly detection, log fault diagnosis, and log summarization. LogEval evaluates each task using 4,000 publicly available log data entries and employs 15 different prompts for each task to ensure a thorough and fair assessment. By rigorously evaluating leading LLMs, we demonstrate the impact of various LLM technologies on log analysis performance, focusing on aspects such as self-consistency and few-shot contextual learning. We also discuss findings related to model quantification, Chinese-English question-answering evaluation, and prompt engineering. These findings provide insights into the strengths and weaknesses of LLMs in multilingual environments and the effectiveness of different prompt strategies. Various evaluation methods are employed for different tasks to accurately measure the performance of LLMs in log analysis, ensuring a comprehensive assessment. The insights gained from LogEvals evaluation reveal the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance for researchers and practitioners.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the issue of inadequate performance evaluation of current large language models (LLMs) in log analysis tasks. Specifically, although LLMs have shown significant potential in natural language processing tasks and have been widely applied in various fields, their performance in log analysis tasks within the IT operations (AIOps) domain has not been fully validated. The main contributions of the paper include: 1. **Proposing the LogEval Benchmark Suite**: - **Dataset Construction**: Constructed a diverse dataset containing 4000 public log entries, covering 15 different Chinese and English prompts to reduce the impact of prompt specificity on model performance. - **Comprehensive Benchmark Development**: Evaluated the performance of 18 mainstream large models on four major log analysis tasks: log parsing, anomaly detection, fault diagnosis, and log summarization. Zero-shot and few-shot evaluation methods were used to ensure consistency and accuracy. - **Multi-dimensional Evaluation Metrics**: Designed various evaluation rules, such as F1 score and accuracy, and introduced new metrics based on semantic matching and average inference time to comprehensively assess the performance of LLMs in log analysis tasks. 2. **Detailed Research and Analysis**: - Through extensive empirical analysis, revealed the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance and reference to help researchers and practitioners better understand and apply these models. - Provided research findings on model quantization, Chinese and English Q&A evaluation, and prompt engineering, showcasing the performance of LLMs in multilingual environments. Through the evaluation and analysis of LogEval, the authors hope to gain a deeper understanding of the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance and reference for practical applications. The ultimate goal is to promote the further development and application of LLMs in the field of log analysis.

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Studying and Benchmarking Large Language Models For Log Level Suggestion

Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies

Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models

LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

PsyEval: A Comprehensive Large Language Model Evaluation Benchmark for Mental Health

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models