Detective-Dee: A Non-Intrusive in Situ Anomaly Detection and Fault Localization Framework
Yang Man,Shiyi Li,Wen Xia,Yikai Li,Bochun Yu,Yingchi Long,Yanqi Pan
DOI: https://doi.org/10.1109/srds60354.2023.00032
2023-01-01
Abstract:Maintaining the high availability of online systems requires reliable and fast online anomaly detection and fault localization. However, existing anomaly detection methods either suffer high training costs and low generalization capabilities or are designed and evaluated using offline data with limited efficacy in online usage. Furthermore, these methods' fault localization capabilities are often inadequate due to external observability constraints. Therefore, designing a new approach to address these limitations effectively is essential. To address the aforementioned limitations, this paper proposes a novel non-intrusive in situ anomaly detection and fault lo-calization framework, Detective-Dee. The proposed framework leverages a compressed sensing method for anomaly detection, which exhibits strong generalization capabilities and eliminates extensive training. Detective-Dee further improves its performance by incorporating three optimization techniques: concurrent sub-stitution sampling, Look-Up-Table-based similarity calculation, and substitution window-based threshold selection to improve parallelism and reduce computational and comparison overheads. Additionally, the framework adopts an innovative non-intrusive fault localization strategy based on anomaly detection triggering. This approach utilizes the dynamic instrumentation capabilities of eBPF, combined with extracting vulnerable function and function call chains through source code analysis, to improve the online anomaly detection capability and achieve robust fault localization with low overhead. To validate the effectiveness of Detective-Dee, we developed a prototype system and conducted a comprehensive evaluation. The results demonstrate that, compared to the state-of-the-art anomaly detection method, Detective-Dee exhibits a 4x improve-ment in anomaly detection speed while maintaining higher online and comparable offline detection ability. Furthermore, under 33 real-world fault cases across eight popular distributed systems, Detective-Dee successfully detects 31 cases and accurately locates 26 cases with less than 1% overhead, outperforming the state-of-the-art method.