Far from Classification Algorithm: Dive into the Preprocessing Stage in DGA Detection.

Mingkai Tong,Guo Li,Runzi Zhang,Jianxin Xue,Wenmao Liu,Jiahai Yang
DOI: https://doi.org/10.1109/trustcom50675.2020.00070
2020-01-01
Abstract:Domain-Flux technique has been widely used by attackers to maintain a botnet for many years and the core of it is the adoption of domain generation algorithm (DGA). To combat attackers, there are lots of works in DGA domain detection area recently. But they usually collect quite limited data and conduct experiments in a closed dataset, meaning that the DGA data and the benign data they collected can not well represent the real distribution between them. Moreover, they handle the domains roughly and use the origin data to train the classifier directly, which is also not adequate to classify these two types of domains with lots of false positives and false negatives happening during the real-world deployment. In this paper, we conduct the first large-scale DGA domain analysis in traffic level and argue that the preprocessing stage is also vital for the final classifier, which is usually ignored by the existing works. We collect the largest amount of DGA domain data than prior works and collect DNS log offered by a big company, whose DNS data covers most important industries in China. Based on this data, we analyze the distribution of DGA domains in traffic and give quantifiable results showing that NXDomain (domain not exist) is more suitable for DGA detection. Moreover, we give detailed preprocessing steps to handle the original domains. Our experiment shows that with the preprocessing stage mentioned above, classifier performs better in DGA detection task. Our research indicates that improving the classification algorithm is far from enough in DGA detection and the preprocessing stage is also the key component in bringing the DGA detection methods from lab to product.
What problem does this paper attempt to address?