Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

Liyang Wang,Yu Cheng,Hao Gong,Jiacheng Hu,Xirui Tang,Iris Li
2024-09-23
Abstract:The sophistication and diversity of contemporary cyberattacks have rendered the use of proxies, gateways, firewalls, and encrypted tunnels as a standalone defensive strategy inadequate. Consequently, the proactive identification of data anomalies has emerged as a prominent area of research within the field of data security. The majority of extant studies concentrate on sample equilibrium data, with the consequence that the detection effect is not optimal in the context of unbalanced data. In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows. Initially, multi-dimensional features are extracted from real-time data, and a clustering algorithm is utilised to analyse the patterns of the data. This enables the potential outliers to be automatically identified. By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data. The results of the experiments demonstrate that the proposed method exhibits high accuracy in the detection of anomalies across a range of scenarios. Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.
Machine Learning,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the unsatisfactory detection effect of existing network security defense mechanisms when facing complex and changeable network attacks, especially in the case of dealing with unbalanced data sets. Specifically: 1. **Limitations of traditional defense mechanisms**: Traditional defense mechanisms such as proxies, gateways, firewalls, and encrypted tunnels are insufficient when facing increasingly complex and diverse network attacks (such as DDoS attacks, data theft, and malware). These methods rely on signature and rule databases of known threats and are difficult to deal with unknown threats in real - time data streams. 2. **Challenges of anomaly detection in dynamic data streams**: In the network environment, data streams are characterized by real - time, continuity, and vastness, which makes traditional static security protection mechanisms difficult to detect potential security threats in a timely manner. Especially in a highly unbalanced data environment (that is, normal traffic accounts for the vast majority and abnormal behavior is less), traditional anomaly detection models based on sample balance often produce a high false positive rate or false negative rate. 3. **Application of unsupervised learning**: To solve the above problems, this paper proposes a dynamic data stream anomaly detection model based on unsupervised learning. By extracting multi - dimensional features from real - time data and using the Density Peaks Clustering (DPC) algorithm for cluster analysis, this model can automatically identify potential abnormal behaviors without the need for labeled data. ### Specific problem description - **Processing of unbalanced data sets**: Most existing research focuses on sample - balanced data, but in the actual network environment, data is usually highly unbalanced, and normal traffic is far more than abnormal traffic. This imbalance leads to poor performance of traditional methods in detecting anomalies. - **Processing of real - time and high - dimensional data**: Dynamic data streams are characterized by real - time and high - dimensionality. How to efficiently extract features and perform anomaly detection is an important challenge. ### Solution The method proposed in this paper mainly solves the above problems through the following steps: 1. **Feature extraction**: Extract multi - dimensional features from real - time data streams, including statistical features in the time domain and frequency domain, and reduce the dimension through Principal Component Analysis (PCA) to improve the efficiency of subsequent clustering algorithms. 2. **Unsupervised clustering algorithm**: Use the Density Peaks Clustering (DPC) algorithm to cluster multi - dimensional features and automatically identify outliers. By calculating the local density and minimum distance of each data point, select data points with high density and far away from other points as clustering centers. 3. **Anomaly scoring**: Define an anomaly scoring formula. According to the data points whose scores exceed the set threshold are marked as abnormal, so as to realize automatic identification of abnormal behaviors without pre - labeling. ### Experimental verification The paper verifies the effectiveness of this method through experiments, using two network traffic data sets, NSL - KDD and UNSW - NB15, and compares them with classic methods such as K - Means, Isolation Forest, and DBSCAN. The experimental results show that this method performs well in terms of detection accuracy, G - Mean, and false positive rate, and has higher robustness and stability especially when dealing with unbalanced data. In summary, this paper aims to solve the deficiencies of existing network security defense mechanisms in dealing with unbalanced data sets by introducing unsupervised learning and innovative feature extraction and clustering algorithms, and improve the accuracy and reliability of anomaly detection in dynamic data streams.