Abstract:The modern Internet is a large-scale distributed system composed of many complex, interoperating sub-networks. Network troubleshooting becomes a very broad, important and challenging problem in the current Internet. We studied the different problems in different diagnosis scenarios, involving different aspects on model, monitoring and diagnosis. Specifically, in this dissertation, we target and propose solutions in the following four concrete problems.First, we consider the monitor placement and path selection problem for active monitoring in the monitoring component. Targeting ISP VPN networks, our work is unique in taking the operational constraints into accounts. The operational constraints include the monitors' measurement ability ( e.g. throughput) and the link bandwidth allowed for measurement. Given these real-world challenges, we design a V Scope monitoring system with the following contributions. First, we design a greedy-assisted linear programming algorithm to select as few monitors as possible that can monitor the whole network under the operational constraints. Secondly, VScope takes a multi-round measurement approach which gives a smooth tradeoff between measurement frequency and monitors deployment/management cost. We propose three algorithms to schedule the path measurements in different rounds obeying the operational constraints. Finally, we design a continuous monitoring and diagnosis mechanism which selects the minimal extra paths to measure to identify the faulty links after the discovery of faulty paths. Second, in the diagnosis aspect, we propose a Least-biased End-to-end Network Diagnosis (in short, LEND) system for inferring link-level properties like loss rate. Unlike other statistics based inference approaches, LEND does not introduce any particular assumption except those in the linear algebraic model. We also found a surprisingly difference between the undirected graph and directed graph in link-level diagnosis and proposed corresponding solutions. We define a minimal identifiable link sequence (MILS) as a link sequence of minimal length whose properties can be uniquely identified from end-to-end measurements. We also design efficient algorithms to find all the MILSes and infer their loss rates for diagnosis. Our LEND system works for any network topology and for both directed and undirected properties, and incrementally adapts to network topology and property changes.Third, it is highly desirable and important for end users, with no special privileges, identify and pinpoint faults inside the network that degrade the performance of their applications. However, existing tools are inaccurate to infer the link-level loss rates and have large diagnosis granularity. We proposed a suite of simple loss rate diagnosis algorithms which only employ one or two ends of a target path. Basically, these algorithms probe the routers on the target path and infer the link-level loss rates based on the response. We propose a suite of user-level diagnosis approaches in two categories: (1) deployed only at the source and (2) deployed at both source and destination. For the former, we propose two fragmentation aided diagnosis approaches (FAD), Algebraic FAD and Opportunistic FAD, which uses IP fragmentation to enable accurate link-level loss rate inference. For the latter category, we propose Striped Probe Analysis (SPA) which significantly improves the diagnosis granularity over those of the source-only approaches.Finally, diagnosing fault and performance problems of large distributed system is an important and challenging problem. Previous research usually traces the requests and reconstructs the execution path, using either inaccurate black-box or intrusive white-box approaches. In this dissertation, we propose a novel semantics assisted gray-box diagnosing approach, Rake, which accurately reveals the execution path of each individualrequest from sniffed network traces. The core idea of Rake is to identify the polymorphic IDs in network messages and link the related messages together via the application semantics. To make Rake a universal tool for general applications, we design a simple Rake language to allow users to provide necessary semantics and hence reuse the core Rake linking component. We analyze, test and evaluate Rake on several popular distributed applications such as the web search system, distributed computing cluster, content provider networks, DNS and chat systems. The results show that Rake can be applied widely in distributed applications and is helpful in performance debugging. (Abstract shortened by UMI.)

DiagNet: towards a generic, Internet-scale root cause analysis solution

Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing

Intelligence Enabled SDN Fault Localization Via Programmable In-band Network Telemetry

G-Cause: Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures

Neuromorphic AI Empowered Root Cause Analysis of Faults in Emerging Networks

Automated Root Cause Analysis with Observability Data - A Comprehensive Review

DyCause: Crowdsourcing to Diagnose Microservice Kernel Failure

Faster, Deeper, Easier: Crowdsourcing Diagnosis of Microservice Kernel Failure from User Space

Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture

NetRCA: An Effective Network Fault Cause Localization Algorithm

Detailed diagnosis in enterprise networks

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

Multi-stage Location for Root-Cause Metrics in Online Service Systems.

KGroot: Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks

MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems

Internet Networking and Application Troubleshooting

CMDiagnostor: an Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data

Generic and Robust Localization of Multi-dimensional Root Causes.

Fault Diagnosis for Test Alarms in Microservices Through Multi-source Data

DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations

Mining Causality Graph for Automatic Web-Based Service Diagnosis.