Research on Malicious URL Detection Based on Cost-sensitive Learning
CAI Qingmeng,WANG Jian,LI Pengbo
DOI: https://doi.org/10.19363/J.cnki.cn10-1380/tn.2023.03.05
2023-01-01
Journal of Cyber Security
Abstract:In the wake of the advent of big data era, whereas the malicious URL, as the medium for Web attacking, progressively threatens the security of users’ information. Traditional detection methods in terms of malicious URL, such as blacklist detection and signature matching, are exposing their intrinsic defects, to this end, this paper proposes a malicious URL detection model based on a cost-sensitive learning strategy. In this thesis, HTTP request parameters together with URL information are employed as the original data samples to extract features; and the corresponding data processing is carried out to resolve the problem of difficult feature extraction incurred by simple URL data. In addition, by comparing three encoding processing methods through tests, this research has chosen the best processing approach in term of character encoding. By doing so, it has ensured the effectiveness of the subsequent detection model. Regarding the model of neural network, the Convolutional Neural Network model suitable for URL detection is specialized designed for the characteristics of URL character input. In this model, in order to extract the deep features of the data, two convolutional layers are broadly used. Secondly, this research utilizes a Bidirectional Long Short-Time Memory to extract the temporal features of the data from the pooling layer, while in the last unit of this network outputs the temporal features to achieve the pooling effect, this research method not only effectively extracts the contextual information regarding the data, also avoids an abundant model calculations and thus, ensures the efficiency of model detection. At the same time, in order to solve the problem of unbalanced data samples, it assigns different penalty factors to data samples during the iterative process, improves the rules for assigning initialization weights to data samples and normalizes them, increases the weight of malicious samples in the overall error function. Experimental results show that this model is better than other mainstream detection models in accuracy, recall and detection efficiency, and has better resistance to imbalanced data sets.