HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning

Yu Chen,Zhiqiang Shi,Hong Li,Weiwei Zhao,Yiliang Liu,Yuansong Qiao
DOI: https://doi.org/10.1007/978-3-030-01054-6_3
2018-11-09
Abstract:Compiler optimization levels are important for binary analysis, but they are not available in COTS binaries. In this paper, we present the first end-to-end system called HIMALIA which recovers compiler optimization levels from disassembled binary code without any knowledge of the target instruction set semantics. We achieve this by formulating the problem as a deep learning task and training a two layer recurrent neural network. Besides the recurrent neural network, HIMALIA is also powered by two other techniques: instruction embedding and a new function representation method. We implement HIMALIA and carry out comprehensive experiments on our dataset consisting of 378,695 different functions from 5828 binaries compiled by GCC. The results show that HIMALIA exhibits accuracy of around 89%. Moreover, we find that HIMALIA’s learnt model is explicable: it can auto-learn common compiler conventions and idioms that match our prior knowledge.
What problem does this paper attempt to address?