Crowdsourcing the discovery of software repositories in an educational environment

Yuxing Ma,Tapajit Dey,Jarred M Smith,Nathan Wilder,Audris Mockus
DOI: https://doi.org/10.7287/peerj.preprints.2551v1
2016-10-24
Abstract:In software repository mining, it's important to have a broad representation of projects. In particular, it may be of interest to know what proportion of projects are public. Discovering public projects can be easily parallelized but not so easy to automate due to a variety of data sources. We evaluate the research and educational potential of crowd-sourcing such research activity in an educational setting. Students were instructed on three ways of discovering the projects and assigned a task to discover the list of public projects from top 45 forges with each student assigned to one forge. Students had to discover as many of the projects as they could using the method of their choice and provide a market-research report for a fictional customer based on the attribute they selected. A subset of the results was sampled and verified for accuracy. We found that many of the public forges do not host public projects, that a substantial fraction of forges do not provide APIs and the APIs vary dramatically among the remaining forges. Some forges have been discontinued and others renamed, making the discovery task into an archaeological exercise. The students' findings raise a number of new research questions and demonstrate the teaching potential of the approach. The accuracy of the results obtained, however, was low, suggesting that crowd-sourcing would require at least two or more likely a larger number of investigators per forge or a better way to gauge investigator skill. We expect that these lessons will be helpful in creating education-sourcing efforts in software data discovery.
What problem does this paper attempt to address?