Formal Definition and Implementation of Reproducibility Tenets for Computational Workflows

Nicholas J. Pritchard,Andreas Wicenec
2024-06-03
Abstract:Computational workflow management systems power contemporary data-intensive sciences. The slowly resolving reproducibility crisis presents both a sobering warning and an opportunity to iterate on what science and data processing entails. The Square Kilometre Array (SKA), the world's largest radio telescope, is among the most extensive scientific projects underway and presents grand scientific collaboration and data-processing challenges. This work presents a scale and system-agnostic computational workflow model and extends five well-known reproducibility tenets into seven defined for our workflow model. Subsequent implementation of these definitions, powered by blockchain primitives, into the Data Activated Flow Graph Engine (DALiuGE), a workflow management system for the SKA, demonstrates the possibility of facilitating automatic formal verification of scientific quality in amortized constant time. We validate our approach with a simple yet representative astronomical processing task; filtering a noisy signal with a lowpass filter with both CPU and GPU methods. Our framework illuminates otherwise obscure scientific discrepancies and similarities between principally identical workflow executions.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the **reproducibility problem of computational workflows**, especially the challenges faced in large - scale scientific projects such as the Square Kilometre Array (SKA) radio telescope. Specifically, the paper aims to: 1. **Propose a scale - and - system - independent computational workflow model** to support the discussion of computational reproducibility. 2. **Define seven reproducibility tenets**, which are extended from five known tenets in the existing literature. 3. **Describe a hash - graph - based workflow signature method (BlockDAGs)**. Based on the above scale - independent model, this method allows these reproducibility tenets to be evaluated in constant time. 4. **Implement this signature mechanism in the DALiuGE workflow management system** and verify its effectiveness. 5. **Demonstrate the effectiveness of this method through a simple astronomical processing task (low - pass filter workflow)**, revealing the subtle differences between in - principle identical executions. ### Background and Motivation of the Paper In recent years, the scientific community has faced a **reproducibility crisis**, which is not only a warning but also an opportunity to improve scientific research and data processing. As the world's largest radio telescope project, SKA involves a large amount of scientific cooperation and data - processing challenges, so special attention needs to be paid to scientific reproducibility. Radio astronomy depends on computational methods, but is essentially an experimental and observational science, thus presenting unique challenges in terms of reproducibility. ### Main Contributions 1. **Propose a scale - independent computational workflow model**: This model is applicable to computational workflows of different scales and systems and is helpful for discussing computational reproducibility. 2. **Define seven reproducibility tenets**: These tenets include Rerun, Repeat, Recompute, Reproduce, Scientific Replication, Computational Replication, and Total Replication. Each tenet defines a testable assertion rather than a series of prescriptive rules. 3. **Hash - graph - based workflow signature method (BlockDAGs)**: This method utilizes the Merkle tree structure in blockchain technology to ensure efficient verification of workflow reproducibility in constant time. 4. **Implement the signature mechanism in the DALiuGE system**: DALiuGE is a workflow management system designed for SKA. This implementation shows how to ensure automatic formal verification of scientific quality in practical applications. 5. **Verify through the low - pass filter workflow**: Filter noise signals using CPU and GPU methods, demonstrating that this framework can reveal the subtle differences between workflows based on the same principle. ### Formulas and Technical Details - **Time complexity of Merkle tree construction**: \[ O(|V|\log|V|) \] where \( |V| \) is the amount of provenance information stored in the component. - **Time complexity of inserting BlockDAG**: \[ O(d\log d) \] where \( d \) is the degree of the component. - **Time complexity of constructing the entire BlockDAG**: \[ O(V(D\log D)+E) \] where \( D \) is the average vertex degree, \( V \) is the number of components, and \( E \) is the number of edges between components. - **Time complexity of final signature construction**: \[ O(l\log l) \] where \( l \) is the number of leaf nodes in the workflow graph. Through these technical means, the paper shows how to achieve efficient scientific reproducibility verification in large - scale data processing.