Abstract:In complex data analyses it is increasingly important to capture information about the usage of data sets in addition to their preservation over time to ensure reproducibility of results, to verify the work of others and to ensure appropriate conditions data have been used for specific analyses. Scientific workflow based studies are beginning to realize the benefit of capturing this provenance of data and the activities used to process, transform and carry out studies on those data. One way to support the development of workflows and their use in (collaborative) biomedical analyses is through the use of a Virtual Research Environment. The dynamic and distributed nature of Grid/Cloud computing, however, makes the capture and processing of provenance information a major research challenge. Furthermore most workflow provenance management services are designed only for data-flow oriented workflows and researchers are now realising that tracking data or workflows alone or separately is insufficient to support the scientific process. What is required for collaborative research is traceable and reproducible provenance support in a full orchestrated Virtual Research Environment (VRE) that enables researchers to define their studies in terms of the datasets and processes used, to monitor and visualize the outcome of their analyses and to log their results so that others users can call upon that acquired knowledge to support subsequent studies. We have extended the work carried out in the neuGRID and N4U projects in providing a so-called Virtual Laboratory to provide the foundation for a generic VRE in which sets of biomedical data (images, laboratory test results, patient records, epidemiological analyses etc.) and the workflows (pipelines) used to process those data, together with their provenance data and results sets are captured in the CRISTAL software.

Applying the Virtual Data Provenance Model

Tracking Provenance in a Virtual Data Grid

Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages

Data provenance tracking as the basis for a biomedical virtual research environment

A virtual data language and system for scientific workflow management in data grid environments

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

Data Provenance for Sport

Specifying and Iterating over Virtual Datasets

Data Provenance Analysis And Description For Etl Based On Prov

Chimera: a Virtual Data System for Representing, Querying, and Automating Data Derivation

Practical Whole-System Provenance Capture

ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows

A Provenance Framework for Web Geoprocessing Workflows

Research and Application of Data Provenance Based on PROV

Using Provenance to Support Good Laboratory Practice in Grid Environments

A Logic Programming Approach to Scientific Workflow Provenance Querying

Research on Data Provenance Model for Multidisciplinary Collaboration.

A Survey on Management of Data Provenance

Scientific Workflows and Provenance: Introduction and Research Opportunities