Abstract:The modern Internet is a large-scale distributed system composed of many complex, interoperating sub-networks. Network troubleshooting becomes a very broad, important and challenging problem in the current Internet. We studied the different problems in different diagnosis scenarios, involving different aspects on model, monitoring and diagnosis. Specifically, in this dissertation, we target and propose solutions in the following four concrete problems.First, we consider the monitor placement and path selection problem for active monitoring in the monitoring component. Targeting ISP VPN networks, our work is unique in taking the operational constraints into accounts. The operational constraints include the monitors' measurement ability ( e.g. throughput) and the link bandwidth allowed for measurement. Given these real-world challenges, we design a V Scope monitoring system with the following contributions. First, we design a greedy-assisted linear programming algorithm to select as few monitors as possible that can monitor the whole network under the operational constraints. Secondly, VScope takes a multi-round measurement approach which gives a smooth tradeoff between measurement frequency and monitors deployment/management cost. We propose three algorithms to schedule the path measurements in different rounds obeying the operational constraints. Finally, we design a continuous monitoring and diagnosis mechanism which selects the minimal extra paths to measure to identify the faulty links after the discovery of faulty paths. Second, in the diagnosis aspect, we propose a Least-biased End-to-end Network Diagnosis (in short, LEND) system for inferring link-level properties like loss rate. Unlike other statistics based inference approaches, LEND does not introduce any particular assumption except those in the linear algebraic model. We also found a surprisingly difference between the undirected graph and directed graph in link-level diagnosis and proposed corresponding solutions. We define a minimal identifiable link sequence (MILS) as a link sequence of minimal length whose properties can be uniquely identified from end-to-end measurements. We also design efficient algorithms to find all the MILSes and infer their loss rates for diagnosis. Our LEND system works for any network topology and for both directed and undirected properties, and incrementally adapts to network topology and property changes.Third, it is highly desirable and important for end users, with no special privileges, identify and pinpoint faults inside the network that degrade the performance of their applications. However, existing tools are inaccurate to infer the link-level loss rates and have large diagnosis granularity. We proposed a suite of simple loss rate diagnosis algorithms which only employ one or two ends of a target path. Basically, these algorithms probe the routers on the target path and infer the link-level loss rates based on the response. We propose a suite of user-level diagnosis approaches in two categories: (1) deployed only at the source and (2) deployed at both source and destination. For the former, we propose two fragmentation aided diagnosis approaches (FAD), Algebraic FAD and Opportunistic FAD, which uses IP fragmentation to enable accurate link-level loss rate inference. For the latter category, we propose Striped Probe Analysis (SPA) which significantly improves the diagnosis granularity over those of the source-only approaches.Finally, diagnosing fault and performance problems of large distributed system is an important and challenging problem. Previous research usually traces the requests and reconstructs the execution path, using either inaccurate black-box or intrusive white-box approaches. In this dissertation, we propose a novel semantics assisted gray-box diagnosing approach, Rake, which accurately reveals the execution path of each individualrequest from sniffed network traces. The core idea of Rake is to identify the polymorphic IDs in network messages and link the related messages together via the application semantics. To make Rake a universal tool for general applications, we design a simple Rake language to allow users to provide necessary semantics and hence reuse the core Rake linking component. We analyze, test and evaluate Rake on several popular distributed applications such as the web search system, distributed computing cluster, content provider networks, DNS and chat systems. The results show that Rake can be applied widely in distributed applications and is helpful in performance debugging. (Abstract shortened by UMI.)

Human readable network troubleshooting based on anomaly detection and feature scoring

HURRA! Human readable router anomaly detection

Opprentice: Towards Practical And Automatic Anomaly Detection Through Machine Learning

Network Anomaly Detection and Localization

Interactive Learning for Network Anomaly Monitoring and Detection with Human Guidance in the Loop

Unsupervised Learning in Next-Generation Networks: Real-Time Performance Self-Diagnosis

Design and implementation for automated network troubleshooting using data mining

Internet Networking and Application Troubleshooting

Anomaly detection system for network transport with machine learning approach

A network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing

High-Performance Unsupervised Anomaly Detection for Cyber-Physical System Networks

Netography: Troubleshoot Your Network With Packet Behavior In Sdn

Does Feature Matter: Anomaly Detection in Sensor Networks

On Real-Time and Self-Taught Anomaly Detection in Optical Networks Using Hybrid Unsupervised/Supervised Learning.

Develop End-to-End Anomaly Detection System

Anomaly detection in wide area network mesh using two machine learning anomaly detection algorithms

Network-Wide Anomaly Detection Based On Router Connection Relationships

From Explanation to Action: An End-to-End Human-in-the-loop Framework for Anomaly Reasoning and Management

Borderline SMOTE Algorithm and Feature Selection-Based Network Anomalies Detection Strategy

Detecting Anomalies In Communication Packet Streams Based On Generative Adversarial Networks

Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model