Exploring Security Commits in Python

Shiyu Sun,Shu Wang,Xinda Wang,Yunlong Xing,Elisa Zhang,Kun Sun
2023-07-22
Abstract:Python has become the most popular programming language as it is friendly to work with for beginners. However, a recent study has found that most security issues in Python have not been indexed by CVE and may only be fixed by 'silent' security commits, which pose a threat to software security and hinder the security fixes to downstream software. It is critical to identify the hidden security commits; however, the existing datasets and methods are insufficient for security commit detection in Python, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features. In this paper, we construct the first security commit dataset in Python, namely PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset. The base dataset contains the security commits associated with CVE records provided by MITRE. To increase the variety of security commits, we build the pilot dataset from GitHub by filtering keywords within the commit messages. Since not all commits provide commit messages, we further construct the augmented dataset by understanding the semantics of code changes. To build the augmented dataset, we propose a new graph representation named CommitCPG and a multi-attributed graph learning model named SCOPY to identify the security commit candidates through both sequential and structural code semantics. The evaluation shows our proposed algorithms can improve the data collection efficiency by up to 40 percentage points. After manual verification by three security experts, PySecDB consists of 1,258 security commits and 2,791 non-security commits. Furthermore, we conduct an extensive case study on PySecDB and discover four common security fix patterns that cover over 85% of security commits in Python, providing insight into secure software maintenance, vulnerability detection, and automated program repair.
Cryptography and Security,Software Engineering
What problem does this paper attempt to address?
The paper primarily focuses on the issue of security commits in the Python programming language. Specifically, the paper aims to address the following key issues: 1. **Identifying Hidden Security Commits**: Many Python security issues are not recorded by CVE (Common Vulnerabilities and Exposures) and may only be fixed through so-called "silent" security commits. These commits often lack explicit log information indicating that they have fixed security vulnerabilities, thus posing a threat to software security. One of the goals of the paper is to identify these hidden security commits. 2. **Building a Python Security Commit Dataset**: Existing datasets and methods are insufficient to address the detection of security commits in Python. These issues include limited data variety, incomplete code semantics, and difficulty in interpreting learned features. To solve this problem, the paper constructs the first Python security commit dataset—PySecDB, which includes three subsets: the base dataset, the pilot dataset, and the enhanced dataset. 3. **Developing Effective Detection Methods**: To enrich the dataset and improve detection efficiency, the paper proposes several methods, including a keyword-based filtering method to identify potential security commits, a new graph representation method (CommitCPG), and a graph-based learning model (SCOPY) to locate security commit candidates based on code changes. 4. **Discovering Security Fix Patterns**: Through extensive case studies on PySecDB, the paper identifies four common security fix patterns. This helps in understanding how to fix security vulnerabilities in Python, thereby providing insights for secure software maintenance, vulnerability detection, and automated program repair. In summary, this paper focuses on constructing a comprehensive Python security commit dataset and developing effective methods to identify and understand these commits. The ultimate goal is to enhance the security and reliability of Python software.