LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models

Zeyang Ma,Dong Jae Kim,Tse-Hsun Chen
2024-11-02
Abstract:Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1)time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2)increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3)privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces OpenLogParser, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. OpenLogParser first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i)similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii)self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that OpenLogParser achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, OpenLogParser addresses privacy and cost concerns of using commercial LLMs while achieving state-of-the-arts parsing efficiency and accuracy.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges faced by existing log - parsing methods when dealing with large - scale log data: 1. **High cost and time - consuming of manual annotation**: Existing log parsers based on large language models (LLMs) require a large amount of manual annotation for fine - tuning or in - context learning, which is not only time - consuming but also increases the labor cost. 2. **High parsing cost**: Due to the huge amount of log data and the limited context window of LLMs, the parsing cost (time and money) increases significantly. As the log scale grows, the parsing cost increases linearly, making practical applications difficult. 3. **Privacy risks**: When using commercial LLMs (such as ChatGPT) for log parsing, there are potential privacy risks because logs usually contain sensitive information, and uploading this information may lead to privacy leakage. To solve these problems, the paper proposes an unsupervised log - parsing method named OpenLogParser, with the following main features: - **No need for manual annotation**: Through the method of unsupervised learning, it avoids the high cost and complexity of manual annotation. - **Use of open - source LLMs**: Adopting a smaller - scale open - source LLM (such as Llama3 - 8B), which not only enhances data privacy protection but also reduces operating costs. - **Efficient parsing**: Through techniques such as fixed - depth grouping trees, retrieval - augmented generation (RAG) based on similarity scores, self - introspection mechanisms, and log template memory, the parsing efficiency and accuracy are significantly improved. Specifically, the working process of OpenLogParser includes the following steps: 1. **Log grouping**: First, group the logs according to the similarity of static texts, but with different dynamic variables. This process is achieved through a fixed - depth grouping tree, ensuring the efficiency and accuracy of grouping. 2. **Unsupervised LLMs parsing**: In each group, select the most representative log samples through Jaccard similarity to guide the LLMs to distinguish between static texts and dynamic variables and generate log templates. 3. **Self - introspection mechanism**: Continuously optimize log templates by iteratively querying the LLMs to improve parsing accuracy. 4. **Log template memory**: Store the parsed log templates to reduce the need for repeated queries to the LLMs and further improve the parsing efficiency. The experimental results show that OpenLogParser is superior to existing LLMs - based parsers in both parsing accuracy and speed, increasing the parsing accuracy rate by 25% and the parsing speed by 2.7 times respectively. In addition, it also effectively solves the privacy and cost problems, providing a safer and more efficient solution for log parsing.