Abstract:The sophistication and diversity of contemporary cyberattacks have rendered the use of proxies, gateways, firewalls, and encrypted tunnels as a standalone defensive strategy inadequate. Consequently, the proactive identification of data anomalies has emerged as a prominent area of research within the field of data security. The majority of extant studies concentrate on sample equilibrium data, with the consequence that the detection effect is not optimal in the context of unbalanced data. In this study, the unsupervised learning method is employed to identify anomalies in dynamic data flows. Initially, multi-dimensional features are extracted from real-time data, and a clustering algorithm is utilised to analyse the patterns of the data. This enables the potential outliers to be automatically identified. By clustering similar data, the model is able to detect data behaviour that deviates significantly from normal traffic without the need for labelled data. The results of the experiments demonstrate that the proposed method exhibits high accuracy in the detection of anomalies across a range of scenarios. Notably, it demonstrates robust and adaptable performance, particularly in the context of unbalanced data.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the unsatisfactory detection effect of existing network security defense mechanisms when facing complex and changeable network attacks, especially in the case of dealing with unbalanced data sets. Specifically: 1. **Limitations of traditional defense mechanisms**: Traditional defense mechanisms such as proxies, gateways, firewalls, and encrypted tunnels are insufficient when facing increasingly complex and diverse network attacks (such as DDoS attacks, data theft, and malware). These methods rely on signature and rule databases of known threats and are difficult to deal with unknown threats in real - time data streams. 2. **Challenges of anomaly detection in dynamic data streams**: In the network environment, data streams are characterized by real - time, continuity, and vastness, which makes traditional static security protection mechanisms difficult to detect potential security threats in a timely manner. Especially in a highly unbalanced data environment (that is, normal traffic accounts for the vast majority and abnormal behavior is less), traditional anomaly detection models based on sample balance often produce a high false positive rate or false negative rate. 3. **Application of unsupervised learning**: To solve the above problems, this paper proposes a dynamic data stream anomaly detection model based on unsupervised learning. By extracting multi - dimensional features from real - time data and using the Density Peaks Clustering (DPC) algorithm for cluster analysis, this model can automatically identify potential abnormal behaviors without the need for labeled data. ### Specific problem description - **Processing of unbalanced data sets**: Most existing research focuses on sample - balanced data, but in the actual network environment, data is usually highly unbalanced, and normal traffic is far more than abnormal traffic. This imbalance leads to poor performance of traditional methods in detecting anomalies. - **Processing of real - time and high - dimensional data**: Dynamic data streams are characterized by real - time and high - dimensionality. How to efficiently extract features and perform anomaly detection is an important challenge. ### Solution The method proposed in this paper mainly solves the above problems through the following steps: 1. **Feature extraction**: Extract multi - dimensional features from real - time data streams, including statistical features in the time domain and frequency domain, and reduce the dimension through Principal Component Analysis (PCA) to improve the efficiency of subsequent clustering algorithms. 2. **Unsupervised clustering algorithm**: Use the Density Peaks Clustering (DPC) algorithm to cluster multi - dimensional features and automatically identify outliers. By calculating the local density and minimum distance of each data point, select data points with high density and far away from other points as clustering centers. 3. **Anomaly scoring**: Define an anomaly scoring formula. According to the data points whose scores exceed the set threshold are marked as abnormal, so as to realize automatic identification of abnormal behaviors without pre - labeling. ### Experimental verification The paper verifies the effectiveness of this method through experiments, using two network traffic data sets, NSL - KDD and UNSW - NB15, and compares them with classic methods such as K - Means, Isolation Forest, and DBSCAN. The experimental results show that this method performs well in terms of detection accuracy, G - Mean, and false positive rate, and has higher robustness and stability especially when dealing with unbalanced data. In summary, this paper aims to solve the deficiencies of existing network security defense mechanisms in dealing with unbalanced data sets by introducing unsupervised learning and innovative feature extraction and clustering algorithms, and improve the accuracy and reliability of anomaly detection in dynamic data streams.

Research on Dynamic Data Flow Anomaly Detection based on Machine Learning

Dynamic Micro-cluster-Based Streaming Data Clustering Method for Anomaly Detection.

Anomaly Intrusion Detection Based on Data Stream

Fast Anomaly Identification Based on Multi-Aspect Data Streams for Intelligent Intrusion Detection Toward Secure Industry 4.0

Anomaly Detection of Streaming Data Based on Deep Learning

High-speed anomaly traffic detection based on staged frequency domain features

Deep Learning for Malicious Flow Detection

Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection

Research of Anomaly detection based on Dynamic Anomaly Detection Enhancement Framework

A Human-in-the-Loop Anomaly Detection Architecture for Big Traffic Data of Cellular Network

A data-driven approach for intrusion and anomaly detection using automated machine learning for the Internet of Things

An Unsupervised Learning-Based Multivariate Anomaly Detection Method for Dynamic Attention Graphs

A Hybrid Deep Learning Anomaly Detection Framework for Intrusion Detection

Unsupervised Abnormal Traffic Detection through Topological Flow Analysis

An Effective Method for Computer Network Anomaly Detection

FlowGANAnomaly: Flow-Based Anomaly Network Intrusion Detection with Adversarial Learning

DPDGAD: A Dual-Process Dynamic Graph-based Anomaly Detection for multivariate time series analysis in cyber-physical systems

Fast Wireless Sensor Anomaly Detection based on Data Stream in Edge Computing Enabled Smart Greenhouse

Flow Graph Anomaly Detection Based on Unsupervised Learning

Simulation Research on Anomaly Detection in Large Data Environment

Flow Interaction Graph Analysis: Unknown Encrypted Malicious Traffic Detection