Improving Software Engineering in Biostatistics: Challenges and Opportunities

Daniel Sabanés Bové,Heidi Seibold,Anne-Laure Boulesteix,Juliane Manitz,Alessandro Gasparini,Burak K. Guünhan,Oliver Boix,Armin Schuüler,Sven Fillinger,Sven Nahnsen,Anna E. Jacob,Thomas Jaki
DOI: https://doi.org/10.48550/arXiv.2301.11791
2023-01-25
Abstract:Programming is ubiquitous in applied biostatistics; adopting software engineering skills will help biostatisticians do a better job. To explain this, we start by highlighting key challenges for software development and application in biostatistics. Silos between different statistician roles, projects, departments, and organizations lead to the development of duplicate and suboptimal code. Building on top of open-source software requires critical appraisal and risk-based assessment of the used modules. Code that is written needs to be readable to ensure reliable software. The software needs to be easily understandable for the user, as well as developed within testing frameworks to ensure that long term maintenance of the software is feasible. Finally, the reproducibility of research results is hindered by manual analysis workflows and uncontrolled code development. We next describe how the awareness of the importance and application of good software engineering practices and strategies can help address these challenges. The foundation is a better education in basic software engineering skills in schools, universities, and during the work life. Dedicated software engineering teams within academic institutions and companies can be a key factor for the establishment of good software engineering practices and catalyze improvements across research projects. Providing attractive career paths is important for the retainment of talents. Readily available tools can improve the reproducibility of statistical analyses and their use can be exercised in community events. [...]
Computation
What problem does this paper attempt to address?
The paper aims to address issues in software engineering within the field of biostatistics and proposes a series of challenges and opportunities. Specifically, the paper focuses on the following aspects: ### Main Problems the Paper Attempts to Solve: 1. **Silos**: - Isolation between different roles, projects, departments, and organizations leads to redundant code development and poor quality. - For example, in the pharmaceutical industry, systems and programming languages used for regulatory statistical analysis differ from those used for exploratory analysis, necessitating recoding. 2. **Reliability**: - As medical practice becomes more data-driven, the application of statistical learning in biomedicine and patient care increases, making software reliability particularly important. - Quality assessment of open-source software is crucial to ensure the accuracy of results. 3. **Usability**: - Open-source projects often lack documentation and intuitive interfaces, making it difficult for other researchers to use these tools. 4. **Maintenance**: - There are challenges in maintaining software in both academia and industry, especially after early-career researchers leave, making subsequent maintenance of software packages problematic. - Insufficient maintenance funding and inadequate incentives for taking over others' projects. 5. **Reproducibility**: - The reproducibility of scientific results is affected by manual analysis processes and uncontrolled code development. - Ensuring the reproducibility of research results becomes more challenging for complex datasets and statistical methods. ### Solutions and Opportunities: 1. **Education**: - Strengthen education in software engineering skills, including coding practices and version control, in schools, universities, and workplaces. - Advocate for programming education starting from elementary school and incorporate software engineering into life sciences curricula. 2. **Establish Dedicated RSE Teams**: - Academic institutions and enterprises should establish dedicated RSE teams to support researchers and promote sustainable software development. - Improve software quality and maintainability through centralized management. 3. **Career Paths**: - Provide more attractive career development paths, including leadership positions and other professional directions. - Ensure good career prospects for software engineers in both academia and industry. 4. **Reproducibility Tools**: - Utilize specialized software tools (such as renv, Rmageddon, etc.) to address dependency management and environment reproduction issues. - Organize events like ReproHack to enhance researchers' awareness and ability to reproduce results. 5. **Reliability Frameworks**: - Promote reliability assessment frameworks, such as the risk assessment methods of the R Validation Hub, to ensure the quality of open-source software. In summary, by exploring the challenges faced in software engineering within the field of biostatistics and proposing corresponding solutions, this paper aims to advance the field towards greater efficiency, reliability, and sustainability.