Abstract:Anomaly is defined as a state of the system that do not conform to the normal
behavior. For example, the emission of neutrons in a nuclear reactor channel
above the specified threshold is an anomaly. Big data refers to the data set
that is \emph{high volume, streaming, heterogeneous, distributed} and often
\emph{sparse}. Big data is not uncommon these days. For example, as per
Internet live stats, the number of tweets posted per day has gone above 500
millions. Due to data explosion in data laden domains, traditional anomaly
detection techniques developed for small data sets scale poorly on large-scale
data sets. Therefore, we take an alternative approach to tackle anomaly
detection in big data. Essentially, there are two ways to scale anomaly
detection in big data. The first is based on the \emph{online} learning and the
second is based on the \emph{distributed} learning. Our aim in the thesis is to
tackle big data problems while detecting anomaly efficiently. To that end, we
first take \emph{streaming} issue of the big data and propose
Passive-Aggressive GMEAN (PAGMEAN) algorithms. Although, online learning
algorithm can scale well over large number of data points and dimensions, they
can not process data when it is distributed at multiple locations; which is
quite common these days. Therefore, we propose anomaly detection algorithm
which is inherently distributed using ADMM. Finally, we present a case study on
anomaly detection in nuclear power plant data.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **efficiently detecting outliers in big data**. Specifically, the paper focuses on how to detect anomalies in big data with the following characteristics:
1. **Sparsity**: The data contains a large number of zero or missing values.
2. **High Dimensionality**: The data has a large number of feature dimensions.
3. **Streaming Data**: Data is continuously generated in the form of a stream and requires real - time processing.
4. **Distributed**: Data is distributed across multiple nodes and requires distributed processing.
### Problem Background
With the advent of the big data era, the dramatic increase in the amount of data has brought new challenges. Traditional anomaly detection methods have encountered bottlenecks when dealing with large - scale, high - dimensional, sparse, and streaming data. For example:
- **Financial Field**: Billions of transactions occur every day, and only a very small number of them are fraudulent transactions. How can abnormal transactions be detected in real - time in this vast amount of data?
- **Medical Health**: Patients' physiological data is monitored in real - time through various sensors, and any abnormal situation needs to be notified to the doctor immediately to prevent life - threatening situations.
- **Computer Networks and Data Centers**: Potential denial - of - service attacks (DoS attacks), unauthorized access, and other security threats need to be detected.
- **Factory Monitoring**: Critical facilities such as nuclear power plants and power plants are monitored by wireless sensors, and the abnormal state of equipment needs to be detected in a timely manner.
- **Video Surveillance**: Video data captured by CCTV cameras needs to be analyzed in real - time to identify potential malicious activities.
- **Satellite Images**: Rare events such as water bodies and rare metals are identified from satellite images.
### Research Objectives
The main research objectives of the paper are:
- **Efficiently Detect Anomalies in Big Data**: Design algorithms that can efficiently detect outliers in large - scale, high - dimensional, sparse, and streaming data.
- **Handle Big Data in Different Scenarios**:
- **Scenario 1**: Handle anomaly detection in streaming data.
- **Scenario 2**: Handle anomaly detection in streaming, sparse, high - dimensional data.
- **Scenario 3**: Handle anomaly detection in sparse, high - dimensional, distributed data.
### Specific Contributions
To achieve the above objectives, the paper proposes the following algorithms:
1. **PAGMEAN Algorithm**:
- An online - learning - based algorithm for handling anomaly detection in streaming data.
- Improves the traditional Passive - Aggressive (PA) algorithm by using a modified hinge loss function to optimize the Gmean performance metric.
- Verifies the effectiveness and competitiveness of the algorithm on multiple real - world and benchmark datasets.
2. **ASPGD Algorithm**:
- Used for handling anomaly detection in streaming, sparse, high - dimensional data.
- Uses a smoothed modified hinge loss function and combines it with the Nesterov - accelerated stochastic proximal gradient descent algorithm.
- Deals with data sparsity through L1 regularization.
- Experimental results show that this algorithm has achieved encouraging results on multiple benchmark and real - world datasets.
3. **DSCIL and CILSD Algorithms**:
- Used for handling anomaly detection in sparse, high - dimensional, distributed data.
- DSCIL is based on the Distributed Alternating Direction Method of Multipliers (DADMM) framework and uses a cost - sensitive, smoothed, and strongly convex hinge loss function.
- CILSD is based on the FISTA - like update rule and has a faster convergence speed.
- Experimental results show that these two algorithms perform excellently in multiple metrics such as Gmean, F - measure, speedup ratio, training time, etc., and are compared with existing techniques.
4. **Unsupervised Anomaly Detection**:
- Proposes an unsupervised anomaly detection method based on Support Vector Data Description (SVDD), which is suitable for situations where data labels are not clear in the real world.
- Verifies the effectiveness of the algorithm through practical applications, such as its application on the KDD Cup 2008 Anomaly Detection Challenge dataset.
In summary, through proposing multiple algorithms, this paper systematically solves the problem of anomaly detection in big data in different scenarios and provides for practical applications.