Defectors: A Large, Diverse Python Dataset for Defect Prediction

Parvez Mahbub,Ohiduzzaman Shuvo,Mohammad Masudur Rahman
DOI: https://doi.org/10.1109/MSR59073.2023.00085
2023-07-25
Abstract:Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: <a class="link-external link-https" href="https://doi.org/10.5281/zenodo.7708984" rel="external noopener nofollow">this https URL</a>
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address several key issues present in existing datasets for software defect prediction: 1. **Insufficient dataset size**: Existing datasets are often not large enough, limiting the performance improvement of machine learning (ML) and deep learning (DL) models. The paper points out that the performance of models generally improves with the increase in dataset size. 2. **Class imbalance in datasets**: The proportion of defect instances in existing datasets is relatively low (usually between 5%-26%), which may lead to poor performance of models in handling defect instances. 3. **Lack of dataset diversity**: Many existing datasets contain data from only a few projects or a single organization, which limits the generalization ability of models across different domains and organizations. To address these issues, the authors propose a large-scale dataset named **Defectors**. This dataset has the following characteristics: - **Size**: Defectors is currently the largest defect prediction dataset, containing approximately 213,000 source code files (about 93,000 defective files and about 120,000 non-defective files). - **Class balance**: The ratio of defective to non-defective instances in the training set is close to 1:1, avoiding the problem of class imbalance. - **Diversity**: The dataset covers 24 popular Python projects from 24 different domains, enhancing the generalization ability of models. - **Platform diversity**: The dataset is based on Python projects, adding diversity compared to most existing Java project datasets. With these improvements, the Defectors dataset aims to provide high-quality data support for training large deep learning models, thereby improving the accuracy and generalization ability of defect prediction.