Abstract:The success of deep learning (DL) techniques has led to their adoption in many fields, including attack investigation, which aims to recover the whole attack story from logged system provenance by analyzing the causality of system objects and subjects. Existing DL-based techniques, e.g., state-of-the-art one ATLAS, follow the design of traditional forensics analysis pipelines. They train a DL model with labeled causal graphs during offline training to learn benign and malicious patterns. During attack investigation, they first convert the log data to causal graphs and leverage the trained DL model to determine if an entity is part of the whole attack chain or not. This design does not fully release the power of DL. Existing works like BERT have demonstrated the superiority of leveraging unsupervised pre-trained models, achieving state-of-the-art results without costly and error-prone data labeling. Prior DL-based attacks investigation has overlooked this opportunity. Moreover, generating and operating the graphs are time-consuming and not necessary. Based on our study, these operations take around 96% of the total analysis time, resulting in low efficiency. In addition, abstracting individual log entries to graph nodes and edges makes the analysis more coarse-grained, leading to inaccurate and unstable results. We argue that log texts provide the same information as causal graphs but are fine-grained and easier to analyze. This paper presents AIRTAG, a novel attack investigation system. It is powered by unsupervised learning with log texts. Instead of training on labeled graphs, AIRTAG leverages unsupervised learning to train a DL model on the log texts. Thus, we do not require the heavyweight and error-prone process of manually labeling logs. During the investigation, the DL model directly takes log files as inputs and predicts entities related to the attack. We evaluated AIRTAG on 19 scenarios, including single-host and multi-host attacks. Our results show the superior efficiency and effectiveness of AIRTAG compared to existing solutions. By removing graph generation and operations, AIRTAG is 2.5x faster than the state-of-the-art method, ATLAS, with 9.0% fewer false positives and 16.5% more true positives on average.

Unified Semantic Log Parsing and Causal Graph Construction for Attack Attribution

SPARSE: Semantic Tracking and Path Analysis for Attack Investigation in Real-time

LogGenius: an Unsupervised Log Parsing Framework with Zero-shot Prompt Engineering

AIRTAG: Towards Automated Attack Investigation by Unsupervised Learning with Log Texts

<sc>Poirot</sc>: Causal Correlation Aided Semantic Analysis for Advanced Persistent Threat Detection

Semantic-Aware Log Understanding and Analysis

UniParser: A Unified Log Parser for Heterogeneous Log Data

Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging

A Graph Learning Approach with Audit Records for Advanced Attack Investigation

SemParser: A Semantic Parser for Log Analysis

High-precision Online Log Parsing with Large Language Models

LogTracer: Efficient Anomaly Tracing Combining System Log Detection and Provenance Graph.

A Causal Graph-Based Approach for APT Predictive Analytics

AttacKG+: Boosting Attack Graph Construction with Large Language Models

A Framework for Human-Centered Exploration of Complex Event Log Graphs

Towards robust log parsing using self-supervised learning for system security analysis

System Log Parsing: A Survey

Marlin: Knowledge-Driven Analysis of Provenance Graphs for Efficient and Robust Detection of Cyber Attacks

LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models

Parsing into Variable-in-situ Logico-Semantic Graphs.

LogPrécis: Unleashing Language Models for Automated Malicious Log Analysis