Detecting Duplicate Pull-requests in GitHub

Zhixing Li,Gang Yin,Yue Yu,Tao Wang,Huaimin Wang
DOI: https://doi.org/10.1145/3131704.3131725
2017-01-01
Abstract:The widespread use of pull-requests boosts the development and evolution for many open source software projects. However, due to the parallel and uncoordinated nature of development process in GitHub, duplicate pull-requests may be submitted by different contributors to solve the same problem. Duplicate pull-requests increase the maintenance cost of GitHub, result in the waste of time spent on the redundant effort of code review, and even frustrate developers' willing to offer continuous contribution. In this paper, we investigate using text information to automatically detect duplicate pull-requests in GitHub. For a new-arriving pull-request, we compare the textual similarity between it and other existing pull-requests, and then return a candidate list of the most similar ones. We evaluate our approach on three popular projects hosted in GitHub, namely Rails, Elasticsearch and Angular.JS. The evaluation shows that about 55.3% -- 71.0% of the duplicates can be found when we use the combination of title similarity and description similarity.
What problem does this paper attempt to address?