Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI Using a Single Model of Malicious Behavior Sequence

Junan Zhang,Kaifeng Huang,Yiheng Huang,Bihuan Chen,Ruisi Wang,Chong Wang,Xin Peng
DOI: https://doi.org/10.1145/3705304
IF: 3.685
2024-01-01
ACM Transactions on Software Engineering and Methodology
Abstract:Open-source software (OSS) supply chain enlarges the attack surface of a software system, which makes package registries attractive targets for attacks. Recently, multiple package registries have received intensified attacks with malicious packages. Of those package registries, NPM and PyPI are two of the most severe victims. Existing malicious package detectors are developed with features from a list of packages of the same ecosystem and deployed within the same ecosystem exclusively, which is infeasible to utilize the knowledge of a new malicious NPM package detected recently to detect the new malicious package in PyPI. Moreover, existing detectors lack support to model malicious behavior of OSS packages in a sequential way To address the two limitations, we propose a single detection model using malicious behavior sequence, named Cerebro, to detect malicious packages in NPM and PyPI. We curate a feature set based on a high-level abstraction of malicious behavior to enable multi-lingual knowledge fusing. We organize extracted features into a behavior sequence to model sequential malicious behavior. We fine-tune the pre-trained language model to understand the semantics of malicious behavior. Extensive evaluation has demonstrated the effectiveness of Cerebro over the state-of-the-art as well as the practically acceptable efficiency. Cerebro has detected 683 and 799 new malicious packages in PyPI and NPM, and received 707 thank letters from the official PyPI and NPM teams.
What problem does this paper attempt to address?