Abstract:Data provenance is a valuable tool for detecting and preventing cyber attack, providing insight into the nature of suspicious events. For example, an administrator can use provenance to identify the perpetrator of a data leak, track an attacker's actions following an intrusion, or even control the flow of outbound data within an organization. Unfortunately, providing relevant data provenance for complex, heterogenous software deployments is challenging, requiring both the tedious instrumentation of many application components as well as a unified architecture for aggregating information between components. In this work, we present a composition of techniques for bringing affordable and holistic provenance capabilities to complex application workflows, with particular consideration for the exemplar domain of web services. We present DAP, a transparent architecture for capturing detailed data provenance for web service components. Our approach leverages a key insight that minimal knowledge of open protocols can be leveraged to extract precise and efficient provenance information by interposing on application components' communications, granting DAP compatibility with existing web services without requiring instrumentation or developer cooperation. We show how our system can be used in real time to monitor system intrusions or detect data exfiltration attacks while imposing less than 5.1 ms end-to-end overhead on web requests. Through the introduction of a garbage collection optimization, DAP is able to monitor system activity without suffering from excessive storage overhead. DAP thus serves not only as a provenance-aware web framework, but as a case study in the non-invasive deployment of provenance capabilities for complex applications workflows.

LogProv: Logging Events As Provenance of Big Data Analytics Pipelines with Trustworthiness.

Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance from Database Logs [Technical Report]

Trusted Provenance of Automated, Collaborative and Adaptive Data Processing Pipelines

Provenance Capture and Use in a Satellite Data Processing Pipeline

Pipeline Provenance for Analysis, Evaluation, Trust or Reproducibility

Astronomical Pipeline Provenance: A Use Case Evaluation

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

Optimizing Provenance Computations

Practical Whole-System Provenance Capture

Putting Lipstick on Pig: Enabling Database-style Workflow Provenance

Data Provenance for Sport

ProvDB: A System for Lifecycle Management of Collaborative Analysis Workflows

A Survey on Management of Data Provenance

Recording How-Provenance on Probabilistic Databases.

Heuristic and Cost-based Optimization for Diverse Provenance Tasks

Retrofitting Applications with Provenance-Based Security Monitoring

Prov-Dominoes: An approach for knowledge discovery from provenance data

Provenance-Based Interpretation of Multi-Agent Information Analysis

s2p: Provenance Research for Stream Processing System