LogProv: Logging Events As Provenance of Big Data Analytics Pipelines with Trustworthiness.

Ruoyu Wang,Daniel Sun,Guoqiang Li,Muhammad Atif,Surya Nepal
DOI: https://doi.org/10.1109/bigdata.2016.7840748
2016-01-01
Abstract:Provenance is information about the origin and creation of data. In data science and engineering, such information is useful and sometimes even critical. In spite of that, provenance for big data is under-explored due to the challenges from the `Vs' of big data. In data analytics, users need to query history, reproduce intermediate or final results, tune models, and adjust parameters in runtime for making data-driven decisions. In addition, users need to evaluate data and pipeline trustworthiness. Towards realising these functionalities for big data provenance, we propose a solution, called LogProv, which needs to renovate data pipelines or even some of big data software infrastructure to generate structured logs for pipeline events, and then stores data and logs separately. The data are explicitly linked to the logs, which implicitly record pipeline semantics. Semantic information can be retrieved from the logs easily since the logs are well defined and structured beforehand. We implemented LogProv in Apache Pig, and adopted ElasticSearch to provide query service. In this paper LogProv is evaluated in a Hadoop ecosystem hosted by a cloud and empirically case-studied. The results show that LogProv is efficient since the performance overhead is no more than 10%, the query can be responded within 1 second, the trustworthiness is marked clearly, and there is no impact on the data processing logic of original pipelines.
What problem does this paper attempt to address?