ORCHID: Streaming Threat Detection over Versioned Provenance Graphs

Akul Goyal,Jason Liu,Adam Bates,Gang Wang
2024-08-24
Abstract:While Endpoint Detection and Response (EDR) are able to efficiently monitor threats by comparing static rules to the event stream, their inability to incorporate past system context leads to high rates of false alarms. Recent work has demonstrated Provenance-based Intrusion Detection Systems (Prov-IDS) that can examine the causal relationships between abnormal behaviors to improve threat classification. However, employing these Prov-IDS in practical settings remains difficult -- state-of-the-art neural network based systems are only fast in a fully offline deployment model that increases attacker dwell time, while simultaneously using simplified and less accurate provenance graphs to reduce memory consumption. Thus, today's Prov-IDS cannot operate effectively in the real-time streaming setting required for commercial EDR viability. This work presents the design and implementation of ORCHID, a novel Prov-IDS that performs fine-grained detection of process-level threats over a real time event stream. ORCHID takes advantage of the unique immutable properties of a versioned provenance graphs to iteratively embed the entire graph in a sequential RNN model while only consuming a fraction of the computation and memory costs. We evaluate ORCHID on four public datasets, including DARPA TC, to show that ORCHID can provide competitive classification performance while eliminating detection lag and reducing memory consumption by two orders of magnitude.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations in practical applications of existing Provenance - based intrusion detection systems (Prov - IDS). Specifically: 1. **High false positive rate**: Traditional Endpoint Detection and Response (EDR) systems efficiently monitor threats by comparing event streams with static rules, but they are unable to incorporate past system contexts, resulting in a relatively high false positive rate. 2. **Insufficient real - time processing ability**: Current Prov - IDS systems can analyze causal relationships to improve threat classification, but they face challenges in actual deployment. The state - of - the - art neural - network - based Prov - IDS systems can only run quickly in a fully offline mode, which increases the attacker's dwell time and uses simplified and inaccurate provenance graphs to reduce memory consumption. 3. **High memory and computing resource consumption**: Existing Prov - IDS systems require a large amount of memory and computing resources to store and analyze complete provenance graphs. For example, analyzing a versioned provenance graph requires 143.7 GB of memory. To solve these problems, the paper proposes ORCHID (Online Root Cause Host Intrusion Detection System), a new type of Prov - IDS, aiming to achieve real - time fine - grained process - level threat detection. The main innovations of ORCHID include: - **Real - time embedding and classification**: ORCHID takes advantage of the immutable characteristics of versioned provenance graphs and iteratively embeds the entire graph through a Recurrent Neural Network (RNN) model while consuming only a small fraction of the computing and memory costs. - **Low memory footprint**: Compared with existing methods, ORCHID only maintains the latest version of each system entity, thereby significantly reducing the memory footprint. - **Long - dependency capture**: By introducing "root node" embedding, ORCHID is able to capture long - term dependency relationships and enhance the ability to recognize attack behaviors. The paper proves through the evaluation of four public datasets (including DARPA TC) that ORCHID eliminates the detection delay and reduces the memory consumption by two orders of magnitude while maintaining competitive classification performance. ### Formula presentation ORCHID uses the following formula for the embedding update of system entities: \[ D[v_j]=f(D[v_j], D[v_i]) \] where \(D\) is an internal dictionary that maps each vertex to its latest embedding, and \(f\) is an RNN model. To capture long - term dependency relationships, ORCHID modifies the RNN update function and introduces "root node" embedding: \[ h_i = w*(h_{i - 1})+b*(x_i)+c*\left[\frac{1}{n}\sum_{i = 0}^{n}r_i\right] \] where \(\{r_i\}\) is the set of root nodes associated with element \(i\) in the sequence, and \(c\) is a learnable model weight used to balance the information introduced by the root embedding. Through these innovations, ORCHID achieves efficient real - time threat detection and significantly reduces resource consumption.