Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarity

I. Made Putrama,Peter Martinek
DOI: https://doi.org/10.1016/j.ins.2024.120242
IF: 8.1
2024-02-02
Information Sciences
Abstract:Data engineers invest significant effort in the early stages of data analysis, including identifying relevant datasets in large and complex data lakes. Data retrieval efficiency becomes an urgent concern as data volumes continue to increase with structures that vary in size and complexity. This paper presents a novel strategy for accelerating data discovery in data lakes. Our approach integrates self-supervised techniques and weighted similarity estimation for efficient dataset classification, facilitating faster search and retrieval. By extracting meta-feature characteristics, our approach improves data traceability in data lakes through clustering, resulting in significant improvements in modularity, showing an efficiency gain of 69.8%. Regarding dataset search through classification, it consistently achieves AUC-ROC scores exceeding 0.85, indicating strong performance in class differentiation. Finally, our method reduces the overall execution time more than twofold and shows promising applications for addressing real-world challenges in various data lake domains.
computer science, information systems
What problem does this paper attempt to address?