Efficiently Identifying Binary Similarity Based on Deep Hashing and Contrastive Learning

Jiaqi Xiong,Shaoyin Cheng,Han Gao,Weiming Zhang
DOI: https://doi.org/10.1109/ICCCBDA56900.2023.10154664
2023-01-01
Abstract:Binary similarity is to identify the semantic similarities of two or more binary code snippets. In recent years, deep learning-based methods have shown promising results. They formalize code similarity as the nearest neighbor retrieval task, and the overall workflow can be divided into two stages: 1) feeding the code snippets into the embedding model to get the corresponding high-dimensional vectors as fingerprints (i.e., constructing the codebase). 2) using the codebase for nearest neighbor retrieval to get the top-k results. Most existing studies only focus on the first stage (more specifically, the embedding model) while ignoring the overhead of the retrieval stage. In real-world scenarios, the codebase could be quite large and contain massive embeddings, which keeps the precise nearest neighbor retrieval prohibitive expensive. To mitigate the issue above, this paper proposes a novel approach, dubbed BinCH, which can efficiently perform code search without sacrificing accuracy.
What problem does this paper attempt to address?