LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

Tianyu Cui,Shiyu Ma,Ziang Chen,Tong Xiao,Shimin Tao,Yilun Liu,Shenglin Zhang,Duoming Lin,Changchang Liu,Yuzhe Cai,Weibin Meng,Yongqian Sun,Dan Pei
2024-07-02
Abstract:Log analysis is crucial for ensuring the orderly and stable operation of information systems, particularly in the field of Artificial Intelligence for IT Operations (AIOps). Large Language Models (LLMs) have demonstrated significant potential in natural language processing tasks. In the AIOps domain, they excel in tasks such as anomaly detection, root cause analysis of faults, operations and maintenance script generation, and alert information summarization. However, the performance of current LLMs in log analysis tasks remains inadequately validated. To address this gap, we introduce LogEval, a comprehensive benchmark suite designed to evaluate the capabilities of LLMs in various log analysis tasks for the first time. This benchmark covers tasks such as log parsing, log anomaly detection, log fault diagnosis, and log summarization. LogEval evaluates each task using 4,000 publicly available log data entries and employs 15 different prompts for each task to ensure a thorough and fair assessment. By rigorously evaluating leading LLMs, we demonstrate the impact of various LLM technologies on log analysis performance, focusing on aspects such as self-consistency and few-shot contextual learning. We also discuss findings related to model quantification, Chinese-English question-answering evaluation, and prompt engineering. These findings provide insights into the strengths and weaknesses of LLMs in multilingual environments and the effectiveness of different prompt strategies. Various evaluation methods are employed for different tasks to accurately measure the performance of LLMs in log analysis, ensuring a comprehensive assessment. The insights gained from LogEvals evaluation reveal the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance for researchers and practitioners.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of inadequate performance evaluation of current large language models (LLMs) in log analysis tasks. Specifically, although LLMs have shown significant potential in natural language processing tasks and have been widely applied in various fields, their performance in log analysis tasks within the IT operations (AIOps) domain has not been fully validated. The main contributions of the paper include: 1. **Proposing the LogEval Benchmark Suite**: - **Dataset Construction**: Constructed a diverse dataset containing 4000 public log entries, covering 15 different Chinese and English prompts to reduce the impact of prompt specificity on model performance. - **Comprehensive Benchmark Development**: Evaluated the performance of 18 mainstream large models on four major log analysis tasks: log parsing, anomaly detection, fault diagnosis, and log summarization. Zero-shot and few-shot evaluation methods were used to ensure consistency and accuracy. - **Multi-dimensional Evaluation Metrics**: Designed various evaluation rules, such as F1 score and accuracy, and introduced new metrics based on semantic matching and average inference time to comprehensively assess the performance of LLMs in log analysis tasks. 2. **Detailed Research and Analysis**: - Through extensive empirical analysis, revealed the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance and reference to help researchers and practitioners better understand and apply these models. - Provided research findings on model quantization, Chinese and English Q&A evaluation, and prompt engineering, showcasing the performance of LLMs in multilingual environments. Through the evaluation and analysis of LogEval, the authors hope to gain a deeper understanding of the strengths and limitations of LLMs in log analysis tasks, providing valuable guidance and reference for practical applications. The ultimate goal is to promote the further development and application of LLMs in the field of log analysis.