Searching Data Lakes for Nested and Joined Data
Yi Zhang,Peter Baile Chen,Zachary G. Ives
DOI: https://doi.org/10.14778/3681954.3682005
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:Exploratory data science is driving new platforms that assist data scientists with everyday tasks, such as integration and wrangling, to assemble training datasets. Such tools take scientists' work-in-progress data as a search object (table or JSON) and find relevant supplementary data from an organizational data lake , which can be unioned or joined with the current data. Existing data lake search tools find single , relational tables to match or join with a search object. Yet many data science applications revolve around hierarchical data, which can only be matched by creating views that simultaneously join and transform several tables in the data lake. In this paper, we extend the Juneau data lake search system [46] for this broader class of matches at scale. Our contribution is a general framework for efficiently merging ranked results to match hierarchical data, leveraging novel techniques for indexing and sketching, and incorporating existing single-table search techniques and ranking functions. We experimentally validate our methods' benefits and broad applicability using real data from data science computational notebooks. Our results indicate that, with different ranking functions, our approach can return the optimal set of views up to 4.8x faster and 43% more related compared to heuristics, and increase the data domain coverage by up to 28%. In a case study to show the utility of our results to data science downstream tasks, we reduce regression error by up to 6.6%, and improve classification accuracy by up to 19.5%.
computer science, information systems, theory & methods