Reliable Malware Analysis and Detection using Topology Data Analysis

Lionel Nganyewou Tidjon,Foutse Khomh
DOI: https://doi.org/10.48550/arXiv.2211.01535
2022-11-09
Abstract:Increasingly, malwares are becoming complex and they are spreading on networks targeting different infrastructures and personal-end devices to collect, modify, and destroy victim information. Malware behaviors are polymorphic, metamorphic, persistent, able to hide to bypass detectors and adapt to new environments, and even leverage machine learning techniques to better damage targets. Thus, it makes them difficult to analyze and detect with traditional endpoint detection and response, intrusion detection and prevention systems. To defend against malwares, recent work has proposed different techniques based on signatures and machine learning. In this paper, we propose to use an algebraic topological approach called topological-based data analysis (TDA) to efficiently analyze and detect complex malware patterns. Next, we compare the different TDA techniques (i.e., persistence homology, tomato, TDA Mapper) and existing techniques (i.e., PCA, UMAP, t-SNE) using different classifiers including random forest, decision tree, xgboost, and lightgbm. We also propose some recommendations to deploy the best-identified models for malware detection at scale. Results show that TDA Mapper (combined with PCA) is better for clustering and for identifying hidden relationships between malware clusters compared to PCA. Persistent diagrams are better to identify overlapping malware clusters with low execution time compared to UMAP and t-SNE. For malware detection, malware analysts can use Random Forest and Decision Tree with t-SNE and Persistent Diagram to achieve better performance and robustness on noised data.
Cryptography and Security,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that currently malware is becoming more and more complex. They spread in the network, target different infrastructures and personal terminal devices to collect, modify and destroy the information of victims. Malware behaviors are polymorphic, metamorphic and persistent. They can hide to bypass detectors, adapt to new environments, and even utilize machine - learning techniques to better damage targets. Therefore, traditional endpoint detection and response, intrusion detection and prevention systems are difficult to effectively analyze and detect these complex malware patterns. To address these problems, existing works have proposed different techniques based on signatures and machine learning. However, these methods still have deficiencies when facing polymorphic, metamorphic malware and zero - day attacks. Therefore, this paper proposes an algebraic - topology - based method - Topological Data Analysis (TDA) to more effectively analyze and detect complex malware patterns. Specifically, the paper compares the performance of different TDA techniques (such as persistent homology, TDA Mapper, tomato graph) and existing techniques (such as PCA, UMAP, t - SNE) on different classifiers (Random Forest, Decision Tree, XGBoost, LightGBM), and proposes suggestions for large - scale deployment of the best identification model. The main goal of the paper is to show how to use TDA techniques to better analyze complex malware relationships and improve the detection ability of new malware samples, especially in the case of noisy data. Through this method, researchers and malware analysts can improve malware analysis and detection methods, thus more effectively dealing with increasingly complex cyber threats.