Ten simple rules for training scientists to make better software
Kit Gallagher,Richard Creswell,Ben Lambert,Martin Robinson,Chon Lok Lei,Gary R. Mirams,David J. Gavaghan
DOI: https://doi.org/10.1371/journal.pcbi.1012410
2024-09-13
PLoS Computational Biology
Abstract:Computational methods and associated software implementations are central to every field of scientific investigation. Modern biological research relies heavily on the development of software tools to process and organize increasingly large data sets, simulate complex mechanistic models, provide tools for the analysis and management of data, and visualize and organize outputs [1–3]. Such software varies widely in its scope, complexity, and potential for re-use, from single-use analysis scripts that accompany journal publications to domain-specific packages (such as molecular dynamics simulators [4,5]), common numerical methods (such as finite element methods [6] or optimization algorithms [7]), and finally fundamental scientific software (such the numerical methods package "numpy" [8]). For valid usage in research, it is essential that this software is both openly available and accurately implements its intended functionality. Accessibility of code has improved significantly in recent years [9], and it is increasingly accepted that research papers should be accompanied by accurate code scripts, which are subject to peer review alongside the other methods of the research. However, this has simultaneously highlighted the role of computational science in the so-called "reproducibility crisis" [10], where multiple cross-disciplinary meta-analyses have indicated that less than half of published code may be run without errors [11–14], and as little as 5% can replicate the primary results of the associated paper [15]. This causes a multitude of negative effects on scientific research including a lack of transparency and open access [16], poor development and deployment practices [17], and a lack of executable reproducibility—where code cannot even be run [18]. This also undermines the productivity of the research software base, as any researchers wishing to use the same computational framework are then forced to re-implement this in their own software. Beyond basic reproducibility, higher-quality software possesses additional qualities such as extensibility, reliability, and reusability. These characteristics arise from carefully designed, well-documented, and appropriately maintained code, and they enable research software to more thoroughly and efficiently support scientific progress—for example, by allowing a software package developed by one research group to be picked up by another which goes on to add additional features to address further scientific questions. The term sustainable (not to be confused with environmentally friendly software) has been adopted to refer to software that is reliable, reproducible, and reusable [19]. Given the importance of high-quality software to effective research in computational biology, there has been significant literature on ensuring reproducibility [20,21] and good development practices [22–24] in computational research. Indeed, several other Ten Simple Rules articles have already provided excellent descriptions of best software development practices to aspire towards, and we point the reader to these guides on documentation [25], usability [24], robustness [26], and version control [27]. However, less attention has been devoted to specific teaching strategies which are effective at nurturing in researchers the complex skillset required to produce high-quality software that underpins both academic and industrial biomedical research. Biologists and computational researchers, even if aware of the importance of high-quality software to their research, are typically left to fend for themselves in developing the necessary skills to produce reusable software. Although training resources are available (for example, courses offered by the Software Sustainability Institute to UK-based researchers), many doctoral training programs overlook extensive formal education in effective software engineering. Two recent articles in the Ten Simple Rules collection [28,29] have discussed the teaching of foundational computer science and coding techniques to biology students. We advance this discussion by describing the specific steps for effectively teaching the necessary skills a scientist needs to develop sustainable software packages that are fit for (re-)use in academic research or more widely. We advocate that future researchers receive extended training in software engineering, moving beyond few-day training sessions and forming a substantial and integrated portion of their scientific education. Although our advice is likely to be applicable to all students and researchers hoping to improve their software development skills, our guidelines are directed towards an audience of students who have some programming literacy but little formal training in software engineering, typical of early doctoral students. These practices are also applicable outside of doctoral training environments, and we believe they should form a key part of postgraduat -Abstract Truncated-
biochemical research methods,mathematical & computational biology