Llasm: Naming Functions in Binaries by Fusing Encoder-only and Decoder-only LLMs

Zihan Sha,Hao Wang,Zeyu Gao,Hui Shu,Bolun Zhang,Ziqing Wang,Chao Zhang
DOI: https://doi.org/10.1145/3702988
IF: 3.685
2024-01-01
ACM Transactions on Software Engineering and Methodology
Abstract:Predicting function names in stripped binaries, which requires succinctly summarizing semantics of binary code in natural languages, is a crucial but challenging task. Recently, many machine learning based solutions have been proposed. However, they have poor generalizability, i.e., fail to handle unseen binaries. To advance the state of the art, we present llasm ( L arge AS sembly L anguage M odel), a novel framework which fuses encoder-only and decoder-only LLMs for function name prediction. It refines encoder-only models to preserve more binary information and learn better binary representations. Then it adopts a novel architecture to project the encoding to the input space of a decoder-only natural language model, which enables it to have better capability of inferring general knowledge and better generalizability. We have evaluated llasm in the BinaryCorp and Debin datasets. llasm outperforms the state-of-the-art function name prediction tools by up to 19.9%, 40.7%, and 36.5% in precision, recall, and F1 score, with significantly better generalizability in unseen binaries. Our case studies further demonstrate the practical use cases of llasm in analyzing real-world malware, showing the usefulness of function name prediction.
What problem does this paper attempt to address?