Fast Reused Function Retrieval Method Based On Simhash And Inverted Index

Yanchen Qiao,Xiaochun Yun,Yongzheng Zhang
DOI: https://doi.org/10.1109/TrustCom.2016.0159
2016-01-01
Abstract:It is a common phenomenon to reuse code from open source code or personal previous work in software/malware development. In addition, compilers often insert many functions when compiling. Therefore, to fast identify these reused functions in binary executables and trace their origins is helpful for reverse engineering, software copyright protection, malware detection and correlation and so on. Much research in recent years has focused on code similarity identification, whereas only a few researchers have addressed the problem of reused function retrieval. In this paper, we proposed a method for fast retrieving reused function in a large corpus of code based on simhash and inverted index. First of all, we constructed a code database including massive binary executables, their functions, the code blocks of functions and the simhash value of code blocks and the inverted index of these elements. Then, for a function to be retrieved, we split it into several code blocks according to the jump instructions and jump addresses, and calculated simhash value for every code block. Similar code blocks could be fast retrieved from the code database based on simhash. Consequently, we could easily retrieve those possibly similar functions using inverted index, and further locate them in binary executables. The experimental evaluation shows that our method achieves high accuracy rate and recall rate, and has a fast speed.
What problem does this paper attempt to address?