Unsupervised Contextual Anomaly Detection for Database Systems
Sainan Li,Qilei Yin,Guoliang Li,Qi Li,Zhuotao Liu,Jinwei Zhu
DOI: https://doi.org/10.1145/3514221.3517861
2022-01-01
Abstract:Abnormal data access operations in database systems always hap-pen, which are typically incurred by misoperations or attacks, though these systems are enforced with strict access control policies. However, prior arts only focus on detecting abnormal data accesses by utilizing known attack patterns or identifying behaviors significantly deviated from normal behaviors. They cannot capture stealthy abnormal data access operations that are similar to normal ones. In this paper, we propose a novel unsupervised anomaly detection system UCAD, which aims to detect abnormal data access operations, by comparing operation's semantics with their contextual intent. However, it is non-trivial to obtain accurate semantics of operations for intent analysis because (i) the same operation may exhibit diverse semantics under different operation contexts and (ii) different operation sequences could have identical semantics due to heterogeneous user access patterns. To address this issue, we develop a new transformer model called Trans-DAS for UCAD. Trans-DAS learns the semantics of individual operations by utilizing the attention mechanism that analyzes the relevance between any pair of operations in sequence, and captures the contextual intent of operations inferred from the contexts. Specifically, Trans-DAS utilizes a particular embedding layer to embed the semantics of individual operations without the operation order information and a masking mechanism that allows Trans-DAS to learn the semantics according to the bidirectional contexts. Also, we define a new training objective for Trans-DAS to enlarge the difference among the embedded semantics. Furthermore, in order to effectively utilize Trans-DAS for detection, we develop two modules in UCAD, i.e., a data preprocessing module that allows Trans-DAS to accurately learn the normal semantic information by removing noisy data, and an anomaly detection module that learns the semantic information for intent comparison. We evaluate the performance of UCAD on real-world data traces under different settings (e.g., varied parameters and hybrid datasets). The results demonstrate that UCAD achieves the average F1-score of 0.94 in two scenarios, which significantly outperform baselines, and shows robustness to hybrid data and good transferability to different tasks.