Abstract:Package managers such as NPM, Maven, and PyPI play a pivotal role in open-source software (OSS) ecosystems, streamlining the distribution and management of various freely available packages. The fine-grained details within software packages can unveil potential risks within existing OSS ecosystems, offering valuable insights for detecting malicious packages. In this study, we undertake a large-scale empirical analysis focusing on fine-grained information (FGI): the metadata, static, and dynamic functions. Specifically, we investigate the FGI usage across a diverse set of 50,000+ legitimate and 1,000+ malicious packages. Based on this diverse data collection, we conducted a comparative analysis between legitimate and malicious packages. Our findings reveal that (1) malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones; (2) malicious packages demonstrate a higher tendency to invoke HTTP/URL functions as opposed to other application services, such as FTP or SMTP; (3) FGI serves as a distinguishable indicator between legitimate and malicious packages; and (4) one dimension in FGI has sufficient distinguishable capability to detect malicious packages, and combining all dimensions in FGI cannot significantly improve overall performance.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the security threat of malware packages in the open - source software (OSS) ecosystem. Specifically, the authors focus on distinguishing legitimate packages from malware packages through fine - grained information (FGI), thus providing an effective method to detect malware packages. The following are the specific objectives and methods of this research: ### Research Background and Problem With the wide application of the open - source software ecosystem, package managers such as NPM, Maven, and PyPI have played a crucial role in distributing and managing various freely available packages. However, these reused packages also bring security risks because attackers may inject malicious code or tamper with originally legitimate packages to attack the system. For example, in 2018, attackers exploited the development permissions of the 'eslint - scope' package to embed malicious executables, causing a large number of systems to be affected. ### Research Objectives This paper aims to explore the application of fine - grained information (FGI) in detecting malware packages through large - scale empirical analysis of more than 50,000 legitimate packages and more than 1,000 malware packages. Specifically, the research mainly focuses on the following aspects: - **Metadata**: including the package name, version, author, dependencies, etc. - **Static functions**: methods directly integrated into the source code, depending on the programming language used. - **Dynamic functions**: flexibility provided during the installation or running phase. ### Research Questions To achieve the above objectives, the research proposes the following research questions (RQs): 1. **RQ1**: How do legitimate and malware packages differ at the metadata level? 2. **RQ1**: How do legitimate and malware packages differ at the static function level? 3. **RQ3**: How do legitimate and malware packages differ at the dynamic function level? 4. **RQ4**: What is the application effect of fine - grained information in malware detection? ### Main Findings Through comparative analysis, the authors draw the following conclusions: 1. **Metadata level**: Malware packages usually have shorter descriptions, fewer authors, lack of URL links, and fewer dependencies. 2. **Static function level**: The types and frequencies of static functions called by malware packages are significantly different from those of legitimate packages, especially in network - related and file - operation aspects. 3. **Dynamic function level**: Malware packages are more likely to call HTTP/URL functions rather than other application services (such as FTP or SMTP). 4. **Detection effect**: The detection model based on fine - grained information can achieve an accuracy rate of 97.5% and a recall rate of 94.4%, indicating that FGI can be used as a reliable distinguishing indicator. ### Summary This research reveals significant differences between legitimate and malware packages through in - depth analysis of fine - grained information and proposes an effective method for detecting malware packages. This result helps to improve the security of the open - source software ecosystem and provides an important reference for developers. If you have more specific questions or need further understanding of certain details, please feel free to let us know!

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

A Large-Scale Empirical Study of Open Source License Usage: Practices and Challenges

OSS Malicious Package Analysis in the Wild

SpiderScan: Practical Detection of Malicious NPM Packages Based on Graph-Based Behavior Modeling and Matching

PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

An Empirical Study of Malicious Code In PyPI Ecosystem

Unveil the Mystery of Critical Software Vulnerabilities

A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping

On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI

Malicious Package Detection using Metadata Information

Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments

Investigating Package Related Security Threats in Software Registries

MalWuKong: Towards Fast, Accurate, and Multilingual Detection of Malicious Code Poisoning in OSS Supply Chains

Killing Two Birds with One Stone: Malicious Package Detection in NPM and PyPI Using a Single Model of Malicious Behavior Sequence

Towards Using Source Code Repositories to Identify Software Supply Chain Attacks

Anomalicious: Automated Detection of Anomalous and Potentially Malicious Commits on GitHub

Practical Automated Detection of Malicious npm Packages

A Survey on Common Threats in npm and PyPi Registries

Malicious Package Detection in NPM and PyPI Using a Single Model of Malicious Behavior Sequence

A Large-Scale Security-Oriented Static Analysis of Python Packages in PyPI