PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages

Kai Gao,Weiwei Xu,Wenhao Yang,Minghui Zhou
2024-04-25
Abstract:A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.
Software Engineering
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that it is difficult to accurately obtain and verify the source code repository information of Python packages (PyPI Packages). Specifically, this problem is mainly reflected in the following aspects: 1. **Separation of development platform and distribution platform**: - The source code repositories of Python packages are usually separated from their distribution platforms (such as PyPI). Although this separation brings many benefits (for example, reducing the size of installation packages, simplifying the build process), it also leads to a disconnection between distribution packages and source code repositories. 2. **Missing or incorrect information in metadata**: - Many existing tools rely on the metadata of packages to retrieve their source code repository information. However, this metadata may not contain or may contain incorrect source code repository information. According to the analysis in the paper, existing tools can only retrieve the repository information of at most 70.5% of PyPI release versions from the metadata. - Developers may not declare the source code repository information in the package's metadata, or may deliberately or inadvertently declare incorrect repository information. 3. **Limitations of existing tools**: - Existing tools mainly rely on metadata to retrieve source code repository information and cannot verify whether the retrieved information is correct. In addition, when there is no repository information in the metadata, these tools cannot work. To solve these problems, the paper proposes a new framework named **PyRadar**. PyRadar uses the metadata and source code distribution of packages to automatically retrieve and verify the source code repository information of PyPI release versions. PyRadar consists of three components: - **Metadata - based Retriever**: Combining the best practices of existing tools, it retrieves repository information from metadata with a success rate of 72.1%. - **Source Code Repository Validator**: It uses machine - learning algorithms to verify six features, and the AUC reaches 0.995. - **Source Code - based Retriever**: It retrieves repository information by querying the SHA - 1 hash values of all Python files in the World of Code, with an accuracy rate of 0.970 and covering 90.2% of the data set. Through these methods, PyRadar can retrieve and verify the source code repository information of PyPI packages more comprehensively and accurately, thus helping developers and researchers better utilize these packages and evaluate their risks. ### Summary The main contributions of this paper include: 1. Conducted the first large - scale empirical study, compared existing metadata - based tools, and investigated the phantom file differences between correct and incorrect package - repository links. 2. Proposed a heuristic method to automatically and accurately collect correct and incorrect package - repository links. 3. Designed and evaluated the PyRadar framework, which uses metadata and source code distribution to automatically retrieve and verify the source code repository information of PyPI release versions.