Automatic State Machine Inference for Binary Protocol Reverse Engineering

Junhai Yang,Fenghua Li,Yixuan Zhang,Junhao Zhang,Liang Fang,Yunchuan Guo
2024-12-04
Abstract:Protocol Reverse Engineering (PRE) is used to analyze protocols by inferring their structure and behavior. However, current PRE methods mainly focus on field identification within a single protocol and neglect Protocol State Machine (PSM) analysis in mixed protocol environments. This results in insufficient analysis of protocols' abnormal behavior and potential vulnerabilities, which are crucial for detecting and defending against new attack patterns. To address these challenges, we propose an automatic PSM inference framework for unknown protocols, including a fuzzy membership-based auto-converging DBSCAN algorithm for protocol format clustering, followed by a session clustering algorithm based on Needleman-Wunsch and K-Medoids algorithms to classify sessions by protocol type. Finally, we refine a probabilistic PSM algorithm to infer protocol states and the transition conditions between these states. Experimental results show that, compared with existing PRE techniques, our method can infer PSMs while enabling more precise classification of protocols.
Cryptography and Security
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of current Protocol Reverse Engineering (PRE) methods in analyzing the mixed - protocol environment. Specifically, existing PRE methods mainly focus on the identification of single protocol fields, while ignoring the analysis of Protocol State Machines (PSM). This has led to insufficient analysis of protocol abnormal behaviors and potential vulnerabilities, thus affecting the ability to detect and defend against new attack patterns. To solve these problems, the author proposes an automated PSM inference framework for inferring the state machines of unknown protocols from network traffic. This framework mainly includes the following aspects: 1. **Protocol Format Clustering**: Use the fuzzy membership - based auto - converging DBSCAN algorithm to extract feature vectors and cluster unknown protocol messages. 2. **Session Clustering**: Based on the Needleman - Wunsch and K - Medoids algorithms, classify sessions by protocol type. 3. **Protocol State Machine Inference**: Infer protocol states and their transition conditions through an improved probabilistic PSM algorithm. Experimental results show that, compared with existing PRE techniques, this method can classify protocols more accurately and infer their state machines. ### Formula Summary - **Minimum Support Calculation Formula**: \[ ms=\frac{|\{mf\subseteq mi|mi\in M\}|}{|M|} \] where \(mf\) is a frequent item set and \(M\) is a message set. - **Fuzzy Membership Function Calculation Formula**: \[ \mu_{ij}=\frac{\text{LCSS}(mf_j, mi)}{\text{length}(mf_j)} \] where \(\text{LCSS}(mf_j, mi)\) represents the length of the longest common substring. - **Feature Vector Distance Calculation Formula**: \[ d(v_i, v_j)=\sqrt{\sum_{k = 1}^{q}(\mu_{ik}-\mu_{jk})^2} \] - **State Transition Probability Calculation Formula**: \[ P_s(c_i\rightarrow c_j)=\frac{N_{c_i\rightarrow c_j}}{N_{c_i\rightarrow\text{all}}} \] \[ P_t(c_i\rightarrow c_j)=\frac{N_{c_i\rightarrow c_j}}{N_{\text{set}}} \] where \(N_{c_i\rightarrow c_j}\) is the number of transitions from state \(c_i\) to state \(c_j\), \(N_{c_i\rightarrow\text{all}}\) is the total number of transitions from state \(c_i\) to other states, and \(N_{\text{set}}\) is the total number of transitions. Through these methods and formulas, this paper aims to improve the understanding and analysis ability of unknown protocols in complex network environments, so as to better meet network security challenges.