Abstract:Data provenance is a valuable tool for detecting and preventing cyber attack, providing insight into the nature of suspicious events. For example, an administrator can use provenance to identify the perpetrator of a data leak, track an attacker's actions following an intrusion, or even control the flow of outbound data within an organization. Unfortunately, providing relevant data provenance for complex, heterogenous software deployments is challenging, requiring both the tedious instrumentation of many application components as well as a unified architecture for aggregating information between components. In this work, we present a composition of techniques for bringing affordable and holistic provenance capabilities to complex application workflows, with particular consideration for the exemplar domain of web services. We present DAP, a transparent architecture for capturing detailed data provenance for web service components. Our approach leverages a key insight that minimal knowledge of open protocols can be leveraged to extract precise and efficient provenance information by interposing on application components' communications, granting DAP compatibility with existing web services without requiring instrumentation or developer cooperation. We show how our system can be used in real time to monitor system intrusions or detect data exfiltration attacks while imposing less than 5.1 ms end-to-end overhead on web requests. Through the introduction of a garbage collection optimization, DAP is able to monitor system activity without suffering from excessive storage overhead. DAP thus serves not only as a provenance-aware web framework, but as a case study in the non-invasive deployment of provenance capabilities for complex applications workflows.

Pipeline Provenance for Cloud‐based Big Data Analytics

LogProv: Logging Events As Provenance of Big Data Analytics Pipelines with Trustworthiness.

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]

Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines

Pipeline Provenance for Analysis, Evaluation, Trust or Reproducibility

Provenance Capture and Use in a Satellite Data Processing Pipeline

Astronomical Pipeline Provenance: A Use Case Evaluation

Practical Whole-System Provenance Capture

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

Optimizing Provenance Computations

A Binary Feature Extraction Based Data Provenance System Implemented on Flink Platform.

Applying the Virtual Data Provenance Model

A Survey on Management of Data Provenance

Cloud‐based provenance framework for duplicates identification and data quality enhancement

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Retrofitting Applications with Provenance-Based Security Monitoring

Provenance Cloud Security Auditing System Based on Log Analysis

Using Cloud-Aware Provenance to Reproduce Scientific Workflow Execution on Cloud

ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows