Detecting DGA-based botnets through effective phonics-based features

Dan Zhao,Hao Li,Xiuwen Sun,Yazhe Tang
DOI: https://doi.org/10.1016/j.future.2023.01.027
IF: 7.307
2023-06-01
Future Generation Computer Systems
Abstract:Botnets are machines that are increasingly controlled by cybercriminals to perform various attacks. Traditional methods of defense, such as blocklisting, become ineffective because illegitimate domain names are sprung out by the domain generation algorithm (DGA) periodically and rapidly to maintain command and control (C&C) on servers. Deep learning and machine learning are candidate solutions to the problem. Deep learning methods leverage high accuracy but cost more time. Machine learning methods are qualified with high training speed in the context of frequent retraining to obtain high accuracy. However, the existing machine learning solutions cannot precisely capture the linguistic characteristics of domain names, which causes many false positives. For a comprehensive understanding of strings of domain names, we present the DOmain Linguistic PHonIcs detectioN (DOLPHIN) method, a novel method that can detect DGA-based botnets. Considering the context of detecting and the correspondence between pronunciations and spellings of words, we design DOLPHIN patterns. They are the classifications of variable-length vowels and consonants following the principles of phonics. Based on DOLPHIN patterns, a novel matching automation is used to reconstruct domain names with the components of variable-length vowels and consonants. From those domain names, DOLPHIN extracts phonics-based features. We implement DOLPHIN in supervised learning methods and compare them to the foremost methods FANCI, HAGDetector, and LSTM.MI. The experimental results show that, compared to FANCI with random forests, DOLPHIN can achieve a higher detection accuracy of 0.0265 with lower FPR and FNR without bringing much overhead. DOLPHIN is also able to generalize to other sources of data in the real world with the FPR decreasing by 0.0801 (62.97%) compared with FANCI. DOLPHIN can cooperate with most linguistic features and brings an improvement in performance compared to that of the existing linguistic feature-based methods.
computer science, theory & methods
What problem does this paper attempt to address?