Abstract:Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for subsequent processing. As a response to these growing challenges, we present FREYJA, a modern data discovery system capable of effectively exploring data lakes, aimed at finding candidates to perform joins and increase the number of attributes for downstream tasks. More precisely, we want to compute rankings that sort potential joins by their relevance. Modern mechanisms apply advanced table representation learning (TRL) techniques to yield accurate joins. Yet, this incurs high computational costs when dealing with elevated volumes of data. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes, which leverages syntactic measurements while achieving accuracy comparable to that of TRL approaches. To obtain this metric in a scalable manner we train a general purpose predictive model. Predictions are based, rather than on large-scale datasets, on data profiles, succinct representations that capture the underlying characteristics of the data. Our experiments show that our system, FREYJA, matches the results of the state-of-the-art whilst reducing the execution times by several orders of magnitude.

What problem does this paper attempt to address?

This paper attempts to solve the problem of efficiently discovering and ranking potential joins in the data lakes environment. Specifically, the authors propose a system named FREYJA, aiming to improve the accuracy and efficiency of join discovery by introducing a novel join - quality measurement method. The following is a detailed interpretation of this problem: ### 1. **Background and Challenges** - **Characteristics of Data Lakes**: Data lakes store a large amount of unprocessed and heterogeneous data, which makes it complex to discover meaningful data joins from them. - **Limitations of Existing Methods**: - **Semantic Methods**: Although they have high accuracy, they are computationally expensive, especially when fine - tuning pre - trained models on large - scale datasets. - **Syntactic Methods**: Although they have high computational efficiency, they are prone to generate a large number of false positives due to data heterogeneity. ### 2. **Research Objectives** - **Propose the FREYJA System**: This system aims to combine the advantages of syntactic and semantic methods to calculate join quality in an efficient and accurate manner. - **Introduce a New Join - Quality Metric**: This metric not only considers the multiset Jaccard index but also introduces the cardinality proportion to better capture the semantic relevance of data. ### 3. **Specific Problems** - **How to efficiently discover potential joins in data lakes?** - **How to reduce computational costs while ensuring accuracy?** ### 4. **Solutions** - **New Join - Quality Metrics**: - **Multiset Jaccard Index**: It is used to measure the degree of overlap between two sets. \[ J(A, B)=\frac{|A \cap B|}{|A|+|B|} \] - **Cardinality Proportion**: It is used to measure the cardinality difference between two sets to capture semantic relevance. \[ K(A, B)=\frac{\min(|A|,|B|)}{\max(|A|,|B|)} \] - **Prediction Model**: Train the prediction model based on data profiles rather than the original large - scale datasets to quickly estimate join quality. ### 5. **Experimental Verification** - **Experimental Results**: The FREYJA system significantly reduces the execution time while maintaining an accuracy comparable to the existing best methods, achieving an improvement of several orders of magnitude. ### Summary The main contribution of this paper is to propose a new join - quality measurement method and achieve efficient and accurate join discovery through the FREYJA system. This method not only improves the performance of join discovery in the data lakes environment but also reduces computational costs, making it more suitable for large - scale data processing tasks.

FREYJA: Efficient Join Discovery in Data Lakes

Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Searching Data Lakes for Nested and Joined Data

Measuring and Predicting the Quality of a Join for Data Discovery

Relaxed Functional Dependency Discovery in Heterogeneous Data Lakes

LakeBench: Benchmarks for Data Discovery over Data Lakes

Integrating Data Lake Tables

Minimally-Supervised Attribute Fusion for Data Lakes

Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarity

A New Design of High-Performance Large-Scale GIS Computing at a Finer Spatial Granularity: A Case Study of Spatial Join with Spark for Sustainability

Finding Related Tables in Data Lakes for Interactive Data Science

Optimizing Federated Queries Based on the Physical Design of a Data Lake

Efficient Join Synopsis Maintenance for Data Warehouse.

Analytic Processing in Data Lakes: A Semantic Query-Driven Discovery Approach

Exploiting Formal Concept Analysis for Data Modeling in Data Lakes

Discovering Multi-Table Functional Dependencies Without Full Join Computation

Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake

Deep Lake: a Lakehouse for Deep Learning

Robust Table Integration in Data Lakes

Reservoir Sampling over Joins

Model Joins: Enabling Analytics Over Joins of Absent Big Tables