OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software

Zhuoran Tan,Christos Anagnosstopoulos,Jeremy Singer
2024-11-28
Abstract:Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.
Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing open - source software (OSS) vulnerability detection methods, especially the limitations of static code analysis in capturing runtime behavior. Specifically: 1. **Limitations of Static Code Analysis**: Most of the existing research mainly focuses on static code analysis and ignores runtime metrics. This has led to insufficient detection of OSS vulnerabilities embedded in complex systems. 2. **Lack of a Comprehensive Dataset for OSS Runtime Behavior**: Although there are some publicly available datasets that focus on the collection of malicious packages and libraries, these datasets fail to cover runtime behavior, especially dynamic features such as files, sockets, commands, and DNS records. To solve these problems, the author has created a comprehensive dataset named OSPtrack, which has the following characteristics: - **Multi - Ecosystem Coverage**: The dataset covers five major open - source software ecosystems (npm, pypi, crates.io, nuget, packagist), ensuring a wide range of application scenarios. - **Includes Static and Dynamic Features**: Each report includes not only the results of static code analysis but also the dynamic features generated at runtime, such as file operations, network activities, and command executions. - **Detailed Label Information**: Each report in the dataset is labeled with verification information and detailed sub - labels for identifying attack types (such as data leakage, malicious command execution, etc.), enabling the identification of malicious behavior even in the absence of source code. By providing such a comprehensive dataset, the author aims to support runtime detection, enhance the training of detection models, and promote efficient comparative analysis between different ecosystems, thereby strengthening the security of the supply chain. ### Formula Representation Although this paper does not involve complex mathematical formulas, when describing some statistical information in the dataset generation process, simple Markdown format can be used to represent key numbers: - The dataset includes 9,461 package reports, of which 1,962 are confirmed as malicious. - The dataset covers 8 - dimensional features, including file, socket, command, and DNS - related behaviors. This information helps to convey the scale and content of the dataset more clearly.