OCRCL: Online Contrastive Learning for Root Cause Localization of Business Incidents
Xiaosong Huang,Hongyi Liu,Yifan Wu,Yujin Zhao,Changlong Wu,Songlin Zhang,Ling Jiang,Tong Jia,Ying Li,Zhonghai Wu
DOI: https://doi.org/10.1109/saner60148.2024.00060
2024-01-01
Abstract:Microservices architecture has garnered extensive attention for its stability and scalability. However, in the complex and dynamic landscape of microservices systems, a incident in one service can propagate to others, resulting in significant economic losses and degraded user experiences. Therefore, the effective and precise localization of incidents in microservices systems becomes a critical concern. Previous research has leveraged runtime data (logs, metrics, call traces) and historical incident data to assist in root cause localization. However, due to the scarcity of business incidents (those causing severe impacts on business operations) and the fact that many incidents are reported by users, relevant run-time data and sufficient historical data are often unavailable, rendering previous methods impractical. In response to this challenge, we propose an online contrastive learning-based method for root cause localization of business incidents(OCRCL). We fully exploit incident tickets and the static dependency graph of services, integrating both textual semantic information and structural information from the dependency graph to discover root causes. Furthermore, we suggest that online contrastive learning can exhibit excellent performance with limited data and enable real-time model updates, making it better suited for industrial scenarios. Our approach demonstrates significant improvements over baseline methods across three real-world industrial datasets, highlighting its effectiveness in root cause localization.