DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories

Akhila Sri Manasa Venigalla,S. Chimalakonda
DOI: https://doi.org/10.1109/MSR59073.2023.00062
2023-05-01
Abstract:Software documentation is one of the critical aspects of a software project, that could support multiple tasks throughout the software development life-cycle. There is extensive research on understanding issues and challenges with existing documentation, which is typically available as readme files. In projects that support collaborative development, such as those on GitHub, other software artifacts such as commits, pull requests and issues, apart from the conventional readme files, wikis and source code comments, also contain useful information, that supports in understanding, using, extending and maintaining the project. However, we are not aware of any dataset that explicitly focuses on documentation-related information in multiple software artifacts such as readme files, commits and pull requests across a repository. To address this concern and to facilitate further research in software documentation, we present DocMine, as a dataset of documentation-related information, extracted from around 1.35M software artifacts in 950 GitHub repositories, spanning across four different programming languages. The dataset along with its documentation is made available in CSV and .sql formats at - https://doi.org/10.5281/zenodo.5195084.
Computer Science
What problem does this paper attempt to address?