Alaska: A Flexible Benchmark for Data Integration Tasks

Valter Crescenzi,Andrea De Angelis,Donatella Firmani,Maurizio Mazzei,Paolo Merialdo,Federico Piai,Divesh Srivastava
DOI: https://doi.org/10.48550/arXiv.2101.11259
2021-02-03
Abstract:Data integration is a long-standing interest of the data management community and has many disparate applications, including business, science and government. We have recently witnessed impressive results in specific data integration tasks, such as Entity Resolution, thanks to the increasing availability of benchmarks. A limitation of such benchmarks is that they typically come with their own task definition and it can be difficult to leverage them for complex integration pipelines. As a result, evaluating end-to-end pipelines for the entire data integration process is still an elusive goal. In this work, we present Alaska, the first benchmark based on real-world dataset to support seamlessly multiple tasks (and their variants) of the data integration pipeline. The dataset consists of ~70k heterogeneous product specifications from 71 e-commerce websites with thousands of different product attributes. Our benchmark comes with profiling meta-data, a set of pre-defined use cases with diverse characteristics, and an extensive manually curated ground truth. We demonstrate the flexibility of our benchmark by focusing on several variants of two crucial data integration tasks, Schema Matching and Entity Resolution. Our experiments show that our benchmark enables the evaluation of a variety of methods that previously were difficult to compare, and can foster the design of more holistic data integration solutions.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the field of data integration, existing benchmarks are usually only for specific tasks (such as entity resolution or schema matching), and it is difficult to support the evaluation of complex, end - to - end data integration processes. Specifically, the paper points out that current benchmarks have the following limitations: 1. **Limitations of task definitions**: Existing benchmarks often come with their own task definitions, which makes them difficult to be used in complex data integration pipelines. 2. **Difficulties in evaluating end - to - end pipelines**: Due to the lack of a comprehensive benchmark, evaluating the end - to - end pipeline of the entire data integration process remains a difficult - to - achieve goal. To overcome these limitations, the paper proposes **Alaska**, a benchmark based on real - world datasets, aiming to seamlessly support multiple tasks and their variants in the data integration pipeline. Alaska contains approximately 70,000 heterogeneous product specifications from 71 e - commerce websites, covering three areas: cameras, monitors, and laptops. This benchmark provides detailed metadata, predefined use cases, and extensive manually - annotated ground truth. ### Main contributions 1. **Support for multiple tasks**: Alaska supports multiple data integration tasks, including schema matching (Schema Matching, SM) and entity resolution (Entity Resolution, ER), and can be easily extended to support other tasks (such as data extraction). 2. **Data heterogeneity**: Alaska contains data sources with different characteristics, from small clean data sources to large dirty data sources, covering various record and attribute representation methods. 3. **Manually - annotated ground truth**: Alaska provides ground truth manually - annotated by domain experts. These ground truth are not only large in quantity but also cover multiple data sources and record attributes, which are suitable for evaluating high - precision methods. ### Specific problems solved - **Schema matching (Schema Matching, SM)**: - **Catalog schema matching (Catalog SM)**: Given a set of data sources and a catalog source, find the correspondence between attributes and catalog source attributes. - **Mediated schema matching (Mediated SM)**: Given a set of data sources and a manually - defined mediated schema, find the correspondence between attributes and mediated schema attributes. - **Entity resolution (Entity Resolution, ER)**: - **Similarity - join ER**: Given two data sources, find pairs of records that refer to the same entity. - **Self - join ER**: Given a set of data sources, find pairs of records that refer to the same entity. - **Schema - agnostic ER**: Execute self - join entity resolution without schema - matching information. ### Challenges - **Synonyms**: The data set contains cases where names are different but refer to the same attribute. - **Homophones**: The data set contains cases where names are the same but refer to different attributes. - **Granularity differences**: There are one - to - one, one - to - many, and many - to - many correspondences in the data set. - **Diversity**: Different data sources may use different formats and naming conventions. - **Noise**: There may be noise values in records due to errors in the original web pages or the data extraction process. - **Skewed distribution**: The distribution of entity cluster sizes in the entire data source set is uneven, with some entities over - represented and others under - represented. By providing a flexible and comprehensive benchmark, Alaska aims to promote the design and evaluation of more holistic data integration solutions.