Code Ownership in Open-Source AI Software Security

Jiawen Wen,Dong Yuan,Lei Ma,Huaming Chen
2023-12-18
Abstract:As open-source AI software projects become an integral component in the AI software development, it is critical to develop a novel methods to ensure and measure the security of the open-source projects for developers. Code ownership, pivotal in the evolution of such projects, offers insights into developer engagement and potential vulnerabilities. In this paper, we leverage the code ownership metrics to empirically investigate the correlation with the latent vulnerabilities across five prominent open-source AI software projects. The findings from the large-scale empirical study suggest a positive relationship between high-level ownership (characterised by a limited number of minor contributors) and a decrease in vulnerabilities. Furthermore, we innovatively introduce the time metrics, anchored on the project's duration, individual source code file timelines, and the count of impacted releases. These metrics adeptly categorise distinct phases of open-source AI software projects and their respective vulnerability intensities. With these novel code ownership metrics, we have implemented a Python-based command-line application to aid project curators and quality assurance professionals in evaluating and benchmarking their on-site projects. We anticipate this work will embark a continuous research development for securing and measuring open-source AI project security.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In open - source AI software projects, how does code ownership affect and reflect the security and vulnerabilities of the software? Specifically, the paper explores the relationship between code ownership and software vulnerabilities, and how this relationship changes with different development stages of the project. By studying this problem, the author aims to provide necessary tools for project managers, maintenance teams, and quality assurance engineers to improve project governance, identify security issues, and enhance user protection. ### Main contributions of the paper include: 1. **New code ownership metric**: A new code ownership metric is proposed, specifically for the security of open - source AI applications. This metric combines the frequency/proportion of software components and time/release attributes, providing deeper insights and revealing the connection between code ownership and open - source AI software vulnerabilities. 2. **Quantitative analysis**: A strict quantitative analysis was carried out on five open - source AI projects with known vulnerabilities, studying the interaction between the code ownership metric, time factors, and software iterations. The results show that the proposed code ownership metric is effective and reliable. 3. **Comparative analysis**: By comparing the new metric with the original code ownership metric and the classical metric, the effectiveness of the new metric was further verified. By analyzing potential bias variables, the understanding of the new metric in the field of open - source AI software was deepened. 4. **Tool development**: A Python - based command - line tool was developed to help developers and quality assurance experts calculate code ownership for the entire repository and specific files. ### Research questions and hypotheses #### Research questions - **RQ1**: In open - source AI projects, how do code ownership metrics develop? How are these metrics related to the characteristics of software components and vulnerabilities? - **RQ2**: Compared with Bird et al.'s original code ownership metric and the classical process metric, how does the proposed metric perform in terms of effectiveness, accuracy, and robustness? What is the impact of modifying variables and thresholds related to code ownership on vulnerability severity classification? - **RQ3**: Does the proposed code ownership metric change with the change of the development stage? What uses can be drawn from practical applications? #### Hypotheses - **Hypothesis 1**: The number of vulnerabilities in software components increases with the increase of minor contributors, which is affected by the duration of the project development stage and operational practices. - **Hypothesis 2**: The vulnerability susceptibility of software components has no relation to its vulnerability occurrence rate. - **Hypothesis 3**: The behavior of software components has no relation to its position within the project scope. ### Results #### Examination of potential distortion factors - **Impact of vulnerability file/commit frequency**: By comparing the calculation results of the metric with the non - vulnerability data set, it was found that the change in vulnerability occurrence rate does not significantly affect the correlation between matrices. - **Impact of minor contributor threshold definition**: Using different thresholds (5%, 10%, 20%, 50%) to generate corresponding metric result correlation heat maps, it was found that the 10% threshold is optimal. - **Impact of local clustering**: By synthesizing two correlation heat map matrices (corresponding to file components and group components respectively), it was found that the locality of software components has no significant impact on the metric results. #### Correlation analysis - **Direct association**: By analyzing the vulnerability and non - vulnerability sources in the balanced data set, it was found that time metrics (especially Days difference and Age) have a strong negative correlation with vulnerabilities. This means that the number of vulnerabilities may decrease over time. Through these studies, the paper provides an important theoretical and practical basis for understanding and improving the security of open - source AI software projects.