Assessing the impact of bag‐of‐words versus word‐to‐vector embedding methods and dimension reduction on anomaly detection from log files

Ziyu Qiu,Zhilei Zhou,Bradley Niblett,Andrew Johnston,Jeffrey Schwartzentruber,Nur Zincir‐Heywood,Malcolm I. Heywood
DOI: https://doi.org/10.1002/nem.2251
2023-10-27
International Journal of Network Management
Abstract:Abstract In terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision‐making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high‐dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word‐to‐vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine‐tuning” are sufficient to match the performance of Bag‐or‐Words embeddings provided by term frequency‐inverse document frequency (TF‐IDF) and (2) the self‐organizing map without dimension reduction provides the most effective anomaly detector.
computer science, information systems,telecommunications
What problem does this paper attempt to address?