Unlocking the Power of Numbers: Log Compression via Numeric Token Parsing

Siyu Yu,Yifan Wu,Ying Li,Pinjia He
2024-08-11
Abstract:Parser-based log compressors have been widely explored in recent years because the explosive growth of log volumes makes the compression performance of general-purpose compressors unsatisfactory. These parser-based compressors preprocess logs by grouping the logs based on the parsing result and then feed the preprocessed files into a general-purpose compressor. However, parser-based compressors have their limitations. First, the goals of parsing and compression are misaligned, so the inherent characteristics of logs were not fully utilized. In addition, the performance of parser-based compressors depends on the sample logs and thus it is very unstable. Moreover, parser-based compressors often incur a long processing time. To address these limitations, we propose Denum, a simple, general log compressor with high compression ratio and speed. The core insight is that a majority of the tokens in logs are numeric tokens (i.e. pure numbers, tokens with only numbers and special characters, and numeric variables) and effective compression of them is critical for log compression. Specifically, Denum contains a Numeric Token Parsing module, which extracts all numeric tokens and applies tailored processing methods (e.g. store the differences of incremental numbers like timestamps), and a String Processing module, which processes the remaining log content without numbers. The processed files of the two modules are then fed as input to a general-purpose compressor and it outputs the final compression results. Denum has been evaluated on 16 log datasets and it achieves an 8.7%-434.7% higher average compression ratio and 2.6x-37.7x faster average compression speed (i.e. 26.2MB/S) compared to the baselines. Moreover, integrating Denum's Numeric Token Parsing into existing log compressors can provide an 11.8% improvement in their average compression ratio and achieve 37% faster average compression speed.
Software Engineering
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the inefficiency and deficiencies of existing log compression methods when dealing with large - scale log data. Specifically, the paper focuses on the following problems: 1. **Inconsistency between parsing and compression goals**: - Existing parser - based log compressors do not fully consider the requirements of compression when parsing logs, resulting in failure to fully utilize the intrinsic characteristics of logs. For example, numeric tokens in logs have similarities among multiple templates, but these similarities are ignored in the template - based parsing process. 2. **Unstable performance**: - The performance of parser - based log compressors depends on sample logs, so their performance is very unstable. For example, the compression ratio of LogShrink on the HealthApp dataset can vary from 13 to 65, and the compression ratio of 13 is even worse than that of general compressors. 3. **Excessive processing time**: - Parser - based log compressors usually require a long processing time, which is a significant bottleneck when dealing with large - scale log data. 4. **Under - utilization of numeric tokens**: - Logs contain a large number of numeric tokens (such as timestamps, counters, thread/process IDs, etc.), but existing compression methods fail to effectively utilize the characteristics of these numeric tokens for compression. Research shows that up to 70% of pre - processed log files are used to store numeric tokens. To solve the above problems, the author proposes a novel log compression tool named Denum. The core idea of Denum is to improve the compression efficiency by specifically processing numeric tokens. Specifically, Denum contains two main modules: - **Numeric Token Parsing module**: This module extracts all numeric tokens from the original log and assigns different labels according to their patterns. Then, specific processing methods are applied to numeric tokens with different labels, such as storing the difference of incremental timestamps. - **String Processing module**: This module processes the remaining log content after removing the numeric values and compresses it using the dictionary - index storage method. Through this method, Denum can significantly improve the compression speed while maintaining a high compression ratio. Experimental results show that on 16 benchmark datasets, the average compression ratio of Denum is 8.7% - 434.7% higher than that of the baseline method, and the average compression speed is 2.6 - 37.7 times faster. In addition, the Numeric Token Parsing module of Denum can also be integrated into existing log compressors to further improve their compression ratio and speed.