Biglog: Unsupervised Large-scale Pre-training for a Unified Log Representation
Yilun Liu,Chang Su,Shimin Tao,Yichen Zhu,Xiaosong Oiao,Yun Li,Xun Chen,Tao Han,Hao Yang,Liang Zhang,Zuomin Ren,Ying Qin,Weinan Tian,Yuming Xie,Weibin Meng
DOI: https://doi.org/10.1109/IWQoS57198.2023.10188759
2023-06-19
Abstract:Automated log analysis has been widely applied in modern data-center network, performing critical tasks such as log parsing, log anomaly detection and log-based failure prediction. However, existing approaches rely on hand-crafted features or domain-specific vectors to represent logs, which are either laborious in manual efforts or ineffective facing multiple domains in a system. Furthermore, general-purpose word embeddings are not optimized for log data, thus are data-inefficient in handling complex log analysis tasks. In this paper, we present a pre-training phase for language models to understand both in-sentence and cross-sentence features of logs, resulting in a unified representation of logs that is well-suited for various downstream analysis tasks. The pre-training phase is unsupervised, utilizing 0.45 billion logs from 16 diverse domains. Experiments on 12 publicly available evaluation datasets across 3 tasks indicate superiority of our approach against existing approaches, especially in online scenarios with limited historical logs. Our approach also exhibits remarkable few-shot learning ability and domain-adaptiveness, which not only outperforms existing approaches using only 0.0025% of their required training data, but also adapts into new domains via only a few in-domain logs. We release our code and pre-trained model.
Computer Science