Abstract:Tracking data lineage is important for data integrity, reproducibility, and debugging data science workflows. However, fine-grained lineage (i.e., at a cell level) is challenging to store, even for the smallest datasets. This paper introduces DSLog, a storage system that efficiently stores, indexes, and queries array data lineage, agnostic to capture methodology. A main contribution is our new compression algorithm, named ProvRC, that compresses captured lineage relationships. Using ProvRC for lineage compression result in a significant storage reduction over functions with simple spatial regularity, beating alternative columnar-store baselines by up to 2000x}. We also show that ProvRC facilitates in-situ query processing that allows forward and backward lineage queries without decompression - in the optimal case, surpassing baselines by 20x in query latency on random numpy pipelines.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently store and query fine - grained data lineage in the data science workflow. Specifically, the paper focuses on how to compress fine - grained lineage data without losing information, while supporting forward and backward lineage queries directly on the compressed data without the need to decompress the data first. This challenge mainly comes from the high storage cost of fine - grained lineage data (i.e., tracking data relationships at the cell level), even for relatively small data sets.
### Main contributions of the paper:
1. **DSLog system**: The paper introduces a storage system named DSLog, which can efficiently store, index, and query the lineage of array data, regardless of the capture method.
2. **ProvRC compression algorithm**: A new compression algorithm named ProvRC is proposed to compress the captured lineage relationships. When dealing with functions with simple spatial regularity, this algorithm can significantly reduce storage requirements, improving up to 2,000 times more than the existing column - based storage baseline method.
3. **In - situ query processing**: It shows how to use ProvRC to achieve in - situ query processing, allowing forward and backward lineage queries without decompression. In the optimal case, the query latency is 20 times faster than the baseline method.
### Technical details:
- **Multi - Attribute Range Encoding**: Compresses data by representing lineage relationships as the union of multi - dimensional "ranges".
- **Relative Value Transformation**: Further compresses data by quantifying the input attributes relative to the output attributes.
- **In - Situ Query Algorithm**: A special θ - join operation is designed, which can directly execute queries on the compressed data without decompression.
### Experimental results:
- **Storage efficiency**: When dealing with operations with simple spatial regularity, the ProvRC algorithm reduces the storage by approximately 99.7% compared to the original data, and improves up to 1,400 times more than the column - based storage baseline method.
- **Query performance**: In the optimal case, it can complete the query of 100,000 input cells within 1 second, improving up to 1,500 times more than the baseline method.
- **Lineage reuse**: The compressed representation supports high - coverage lineage reuse, achieving input - independent reuse for 99 out of 136 evaluated NumPy operations.
### Summary:
By introducing the DSLog system and the ProvRC compression algorithm, the paper solves the efficiency problem of fine - grained data lineage storage and query, significantly reduces the storage cost, and improves the query performance. These techniques have important application values in the data science workflow, especially in scenarios where lineage queries need to be frequently performed.