Unmasking the Shadows: Pinpoint the Implementations of Anti-Dynamic Analysis Techniques in Malware Using LLM

Haizhou Wang,Nanqing Luo,Peng LIu
2024-11-09
Abstract:Sandboxes and other dynamic analysis processes are prevalent in malware detection systems nowadays to enhance the capability of detecting 0-day malware. Therefore, techniques of anti-dynamic analysis (TADA) are prevalent in modern malware samples, and sandboxes can suffer from false negatives and analysis failures when analyzing the samples with TADAs. In such cases, human reverse engineers will get involved in conducting dynamic analysis manually (i.e., debugging, patching), which in turn also gets obstructed by TADAs. In this work, we propose a Large Language Model (LLM) based workflow that can pinpoint the location of the TADA implementation in the code, to help reverse engineers place breakpoints used in debugging. Our evaluation shows that we successfully identified the locations of 87.80% known TADA implementations adopted from public repositories. In addition, we successfully pinpoint the locations of TADAs in 4 well-known malware samples that are documented in online malware analysis blogs.
Cryptography and Security,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately locate the implementation position of anti - dynamic analysis techniques (TADA) in malware, so as to help reverse engineers place breakpoints during the debugging process. Specifically, the goals of the paper are as follows: 1. **Reduce the time and labor costs of reverse engineers**: By automatically identifying the TADA implementation positions in malware and providing a set of breakpoint suggestions, reverse engineers can more efficiently bypass these anti - dynamic analysis techniques and conduct root - cause analysis. 2. **Improve the efficiency of malware analysis**: When automated dynamic analysis fails, reverse engineers need to manually analyze malware. The existence of TADA makes this process more complex and time - consuming. The method proposed in this paper aims to reduce the workload of reverse engineers and improve analysis efficiency through automated means. ### Background and Motivation of the Paper #### Background - **Dynamic analysis and sandbox detection**: Modern malware detection systems widely use sandboxes and dynamic analysis to detect 0 - day malware. However, many malware samples implement anti - dynamic analysis techniques (TADA), resulting in false positives or analysis failures in sandbox detection. - **Diversity of TADA**: TADA can check various aspects such as hardware, running processes, file systems, and user traces, and different malware may adopt different implementation methods. This makes it difficult for rule - based static analysis methods to comprehensively cover all TADA. #### Motivation - **Challenges in reverse engineering**: When automated dynamic analysis fails, reverse engineers need to manually analyze malware. The existence of TADA makes this process more complex and time - consuming. Therefore, a method is needed to help reverse engineers quickly locate the implementation position of TADA. - **Application potential of LLM**: Large - language models (LLM) perform excellently in understanding natural languages and can be used to parse strings and other features in anti - dynamic analysis techniques, thus assisting reverse engineering. ### Overview of the Method #### Workflow 1. **Program analysis**: Extract basic blocks (BB) and their related features, including assembly features, API call features, and string features. 2. **Feature construction**: Convert the extracted features into natural - language descriptions and construct prompts. 3. **LLM query**: Send the prompts to the LLM and obtain a score for whether each basic block belongs to the TADA implementation. 4. **Breakpoint suggestion**: According to the scores of the LLM, select appropriate breakpoint positions to help reverse engineers conduct debugging. #### Feature extraction - **Assembly features**: Extract instruction mnemonics and memory access features from the assembly code. For example, the `pushf` and `popf` instructions may indicate debugger detection, and the `cpuid` instruction may indicate virtual machine detection. - **API call features**: Extract API calls, especially those related to anti - dynamic analysis, such as `IsDebuggerPresent`. - **String features**: Extract strings, especially those that may be used to detect sandbox environments, such as usernames and file names. ### Experimental Results - **Accuracy**: The workflow proposed in the paper successfully identified 87.80% of the known TADA implementations from public repositories. - **Practical application**: Successfully located the implementation positions of TADA in 4 well - known malware samples. ### Conclusion The paper proposes a workflow based on large - language models (LLM) that can automatically and accurately locate the implementation positions of anti - dynamic analysis techniques (TADA) in malware, significantly reducing the time and labor costs of reverse engineers and improving the efficiency of malware analysis.