R2D2: Reducing Redundancy and Duplication in Data Lakes

Raunak Shah,Koyel Mukherjee,Atharv Tyagi,Sai Keerthana Karnam,Dhruv Joshi,Shivam Bhosale,Subrata Mitra
DOI: https://doi.org/10.1145/3626762
2023-12-21
Abstract:Enterprise data lakes often suffer from substantial amounts of duplicate and redundant data, with data volumes ranging from terabytes to petabytes. This leads to both increased storage costs and unnecessarily high maintenance costs for these datasets. In this work, we focus on identifying and reducing redundancy in enterprise data lakes by addressing the problem of 'dataset containment'. To the best of our knowledge, this is one of the first works that addresses table-level containment at a large scale. We propose R2D2: a three-step hierarchical pipeline that efficiently identifies almost all instances of containment by progressively reducing the search space in the data lake. It first builds (i) a schema containment graph, followed by (ii) statistical min-max pruning, and finally, (iii) content level pruning. We further propose minimizing the total storage and access costs by optimally identifying redundant datasets that can be deleted (and reconstructed on demand) while respecting latency constraints. We implement our system on Azure Databricks clusters using Apache Spark for enterprise data stored in ADLS Gen2, and on AWS clusters for open-source data. In contrast to existing modified baselines that are inaccurate or take several days to run, our pipeline can process an enterprise customer data lake at the TB scale in approximately 5 hours with high accuracy. We present theoretical results as well as extensive empirical validation on both enterprise (scale of TBs) and open-source datasets (scale of MBs - GBs), which showcase the effectiveness of our pipeline.
Databases
What problem does this paper attempt to address?
This paper attempts to address the issue of large amounts of duplicate and redundant data in enterprise data lakes, specifically including increased storage costs and unnecessary high maintenance expenses. The focus of the paper is on identifying and reducing the "dataset containment" problem in data lakes. The main contributions of the paper include: 1. **Proposing the R2D2 framework**: A hierarchical and efficient framework for quickly and accurately identifying dataset containment relationships on terabyte-scale data. It achieves this by progressively narrowing the search space, first constructing a schema containment graph, then performing statistical min-max pruning, and finally conducting content-level pruning. 2. **Efficient schema clustering algorithm**: An algorithm for constructing a schema containment graph between datasets, and proving that this method does not miss any real containment relationships. 3. **Statistical and content-level pruning algorithms**: Pruning based on the min-max values of numerical columns and content similarity to progressively build the dataset containment graph, and theoretically defining the sampling complexity. 4. **Cost optimization algorithm**: Providing an optimization algorithm that minimizes overall storage and access costs while considering latency constraints, and identifying redundant datasets that can be deleted (and rebuilt if needed). 5. **Extensive empirical validation**: Validating the effectiveness of the method through extensive empirical results, including enterprise data (terabyte-scale) and open-source data (megabyte to gigabyte scale), and comparing it with existing baseline methods. In summary, this paper aims to help enterprises reduce redundant data in data lakes by efficiently identifying and removing duplicate datasets, thereby lowering storage and maintenance costs.