Internet Networking and Application Troubleshooting
Yan Chen,Yao Zhao
2009-01-01
Abstract:The modern Internet is a large-scale distributed system composed of many complex, interoperating sub-networks. Network troubleshooting becomes a very broad, important and challenging problem in the current Internet. We studied the different problems in different diagnosis scenarios, involving different aspects on model, monitoring and diagnosis. Specifically, in this dissertation, we target and propose solutions in the following four concrete problems.First, we consider the monitor placement and path selection problem for active monitoring in the monitoring component. Targeting ISP VPN networks, our work is unique in taking the operational constraints into accounts. The operational constraints include the monitors' measurement ability ( e.g. throughput) and the link bandwidth allowed for measurement. Given these real-world challenges, we design a V Scope monitoring system with the following contributions. First, we design a greedy-assisted linear programming algorithm to select as few monitors as possible that can monitor the whole network under the operational constraints. Secondly, VScope takes a multi-round measurement approach which gives a smooth tradeoff between measurement frequency and monitors deployment/management cost. We propose three algorithms to schedule the path measurements in different rounds obeying the operational constraints. Finally, we design a continuous monitoring and diagnosis mechanism which selects the minimal extra paths to measure to identify the faulty links after the discovery of faulty paths. Second, in the diagnosis aspect, we propose a Least-biased End-to-end Network Diagnosis (in short, LEND) system for inferring link-level properties like loss rate. Unlike other statistics based inference approaches, LEND does not introduce any particular assumption except those in the linear algebraic model. We also found a surprisingly difference between the undirected graph and directed graph in link-level diagnosis and proposed corresponding solutions. We define a minimal identifiable link sequence (MILS) as a link sequence of minimal length whose properties can be uniquely identified from end-to-end measurements. We also design efficient algorithms to find all the MILSes and infer their loss rates for diagnosis. Our LEND system works for any network topology and for both directed and undirected properties, and incrementally adapts to network topology and property changes.Third, it is highly desirable and important for end users, with no special privileges, identify and pinpoint faults inside the network that degrade the performance of their applications. However, existing tools are inaccurate to infer the link-level loss rates and have large diagnosis granularity. We proposed a suite of simple loss rate diagnosis algorithms which only employ one or two ends of a target path. Basically, these algorithms probe the routers on the target path and infer the link-level loss rates based on the response. We propose a suite of user-level diagnosis approaches in two categories: (1) deployed only at the source and (2) deployed at both source and destination. For the former, we propose two fragmentation aided diagnosis approaches (FAD), Algebraic FAD and Opportunistic FAD, which uses IP fragmentation to enable accurate link-level loss rate inference. For the latter category, we propose Striped Probe Analysis (SPA) which significantly improves the diagnosis granularity over those of the source-only approaches.Finally, diagnosing fault and performance problems of large distributed system is an important and challenging problem. Previous research usually traces the requests and reconstructs the execution path, using either inaccurate black-box or intrusive white-box approaches. In this dissertation, we propose a novel semantics assisted gray-box diagnosing approach, Rake, which accurately reveals the execution path of each individualrequest from sniffed network traces. The core idea of Rake is to identify the polymorphic IDs in network messages and link the related messages together via the application semantics. To make Rake a universal tool for general applications, we design a simple Rake language to allow users to provide necessary semantics and hence reuse the core Rake linking component. We analyze, test and evaluate Rake on several popular distributed applications such as the web search system, distributed computing cluster, content provider networks, DNS and chat systems. The results show that Rake can be applied widely in distributed applications and is helpful in performance debugging. (Abstract shortened by UMI.)