Anomaly Detection on Unstable Logs with GPT Models

Fatemeh Hadadi,Qinghua Xu,Domenico Bianculli,Lionel Briand
2024-06-12
Abstract:Log-based anomaly detection has been widely studied in the literature as a way to increase the dependability of software-intensive systems. In reality, logs can be unstable due to changes made to the software during its evolution. This, in turn, degrades the performance of downstream log analysis activities, such as anomaly detection. The critical challenge in detecting anomalies on these unstable logs is the lack of information about the new logs, due to insufficient log data from new software versions. The application of Large Language Models (LLMs) to many software engineering tasks has revolutionized various domains. In this paper, we report on an experimental comparison of a fine-tuned LLM and alternative models for anomaly detection on unstable logs. The main motivation is that the pre-training of LLMs on vast datasets may enable a robust understanding of diverse patterns and contextual information, which can be leveraged to mitigate the data insufficiency issue in the context of software evolution. Our experimental results on the two-version dataset of LOGEVOL-Hadoop show that the fine-tuned LLM (GPT-3) fares slightly better than supervised baselines when evaluated on unstable logs. The difference between GPT-3 and other supervised approaches tends to become more significant as the degree of changes in log sequences increases. However, it is unclear whether the difference is practically significant in all cases. Lastly, our comparison of prompt engineering (with GPT-4) and fine-tuning reveals that the latter provides significantly superior performance on both stable and unstable logs, offering valuable insights into the effective utilization of LLMs in this domain.
Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: during the evolution of software systems, the instability of logs (unstable logs) caused by changes in the structure and content of logs leads to a decline in the performance of existing log - based anomaly detection methods. Specifically, when software is updated, logs may change as follows: - **Log Template Level**: Log templates may be added, deleted, or modified. - **Log Sequence Level**: The order of log messages may change, and some log templates may be added or removed. These changes will cause the logs generated by the new version of the software to be different from those of the old version, so that the anomaly detection model trained on the old - version logs performs poorly when processing the new - version logs. Therefore, this research aims to explore how to use large - language models (LLMs) to deal with this problem of insufficient data, especially in the context of software evolution. ### Research Background 1. **Log Instability Problem**: - Software systems are constantly evolving, and logs also change accordingly. - These changes make logs unstable and affect the performance of downstream log analysis tasks (such as anomaly detection). - The main challenge is the lack of new - version log data, which makes it difficult to retrain or adjust existing models. 2. **Application of Large - Language Models**: - LLMs have learned a large number of text patterns and context information during the pre - training stage. - This ability may help alleviate the problem of insufficient data brought by software evolution and improve the robustness of anomaly detection. ### Solutions The paper proposes two strategies for using LLMs to deal with the log instability problem: 1. **Fine - tuning**: - Use specific log data to fine - tune the LLM to make it adapt to a specific task (i.e., ADUL). - In this way, the model can better understand the patterns and context information in the logs. 2. **Prompt Engineering**: - Construct effective prompts and input them into the pre - trained LLM for anomaly detection. - Prompts usually include task descriptions, expected inputs/outputs, and examples of related tasks. ### Experimental Results Through experiments on two public datasets (LOGEVOL - Hadoop and HDFS) and a synthetic dataset (SynHDFS), the authors draw the following conclusions: 1. **The performance of fine - tuning GPT - 3 on unstable logs is slightly better than that of the supervised baseline method**. 2. **As the degree of log change increases, the difference between fine - tuning GPT - 3 and other supervised methods becomes more significant**. 3. **Fine - tuning GPT - 3 outperforms prompt - engineering GPT - 4 on both stable and unstable logs, which provides valuable insights for effectively using LLMs for ADUL**. ### Summary The main contribution of this paper lies in exploring the application of LLMs in handling anomaly detection of unstable logs, especially by comparing the two strategies of fine - tuning and prompt engineering, verifying the superiority of fine - tuning in this task. This provides an important reference for future research and practical applications.