Defectors: A Large, Diverse Python Dataset for Defect Prediction

Parvez Mahbub,Ohiduzzaman Shuvo,Mohammad Masudur Rahman

DOI: https://doi.org/10.1109/MSR59073.2023.00085

2023-07-25

Abstract:Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their datasets. In this paper, we present Defectors, a large dataset for just-in-time and line-level defect prediction. Defectors consists of $\approx$ 213K source code files ($\approx$ 93K defective and $\approx$ 120K defect-free) that span across 24 popular Python projects. These projects come from 18 different domains, including machine learning, automation, and internet-of-things. Such a scale and diversity make Defectors a suitable dataset for training ML/DL models, especially transformer models that require large and diverse datasets. We also foresee several application areas of our dataset including defect prediction and defect explanation. Dataset link: <a class="link-external link-https" href="https://doi.org/10.5281/zenodo.7708984" rel="external noopener nofollow">this https URL</a>

Software Engineering

What problem does this paper attempt to address?

The paper attempts to address several key issues present in existing datasets for software defect prediction: 1. **Insufficient dataset size**: Existing datasets are often not large enough, limiting the performance improvement of machine learning (ML) and deep learning (DL) models. The paper points out that the performance of models generally improves with the increase in dataset size. 2. **Class imbalance in datasets**: The proportion of defect instances in existing datasets is relatively low (usually between 5%-26%), which may lead to poor performance of models in handling defect instances. 3. **Lack of dataset diversity**: Many existing datasets contain data from only a few projects or a single organization, which limits the generalization ability of models across different domains and organizations. To address these issues, the authors propose a large-scale dataset named **Defectors**. This dataset has the following characteristics: - **Size**: Defectors is currently the largest defect prediction dataset, containing approximately 213,000 source code files (about 93,000 defective files and about 120,000 non-defective files). - **Class balance**: The ratio of defective to non-defective instances in the training set is close to 1:1, avoiding the problem of class imbalance. - **Diversity**: The dataset covers 24 popular Python projects from 24 different domains, enhancing the generalization ability of models. - **Platform diversity**: The dataset is based on Python projects, adding diversity compared to most existing Java project datasets. With these improvements, the Defectors dataset aims to provide high-quality data support for training large deep learning models, thereby improving the accuracy and generalization ability of defect prediction.

Defectors: A Large, Diverse Python Dataset for Defect Prediction

Unifying Defect Prediction, Categorization, and Repair by Multi-Task Deep Learning

Deep Learning for Just-In-Time Defect Prediction

Gdefects4dl: A Dataset of General Real-World Deep Learning Program Defects

Continuous Defect Prediction: The Idea and a Related Dataset

Transductive Instance Transfer Learning for Cross-Language Defect Prediction

Predicting Line-Level Defects by Capturing Code Contexts with Hierarchical Transformers

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

ConDefects: A Complementary Dataset to Address the Data Leakage Concern for LLM-Based Fault Localization and Program Repair

Combined Classifier for Cross-Project Defect Prediction: an Extended Empirical Study.

A New Improved Prediction of Software Defects Using Machine Learning-based Boosting Techniques with NASA Dataset

Software Defect Prediction for Healthcare Big Data: An Empirical Evaluation of Machine Learning Techniques

Software Defect Prediction via Transformer

Is Deep Learning Good Enough for Software Defect Prediction?

A Survey on Software Defect Prediction Using Deep Learning

Gdefects4dl

A Survey of Software Defect Prediction Based on Deep Learning

Software visualization and deep transfer learning for effective software defect prediction

Understanding machine learning software defect predictions