A Dataset of Duplicate Pull-Requests in GitHub

Yue Yu,Zhixing Li,Gang Yin,Tao Wang,Huaimin Wang
DOI: https://doi.org/10.1145/3196398.3196455
2018-01-01
Abstract:In GitHub, the pull-based development model enables community contributors to collaborate in a more efficient way. However, the distributed and parallel characteristics of this model pose a potential risk for developers to submit duplicate pull-requests (PRs), which increase the extra cost of project maintenance. To facilitate the further studies to better understand and solve the issues introduced by duplicate PRs, we construct a large dataset of historical duplicate PRs extracted from 26 popular open source projects in GitHub by using a semi-automatic approach. Furthermore, we present some preliminary applications to illustrate how further researches can be conducted based on this dataset.
What problem does this paper attempt to address?