Online Event Integration with StoryPivot

Anja Gruenheid,Donald Kossmann,Divesh Srivastava
DOI: https://doi.org/10.48550/arXiv.1610.07732
2016-10-25
Databases
Abstract:Modern data integration systems need to process large amounts of data from a variety of data sources and with real-time integration constraints. They are not only employed in enterprises for managing internal data but are also used for a variety of web services that use techniques such as entity resolution or data cleaning in live systems. In this work, we discuss a new generation of data integration systems that operate on (un-)structured data in an online setting, i.e., systems which process continuously modified datasets upon which the integration task is based. We use as an example of such a system an online event integration system called StoryPivot. It observes events extracted from news articles in data sources such as the 'Guardian' or the 'Washington Post' which are integrated to show users the evolution of real-world stories over time. The design decisions for StoryPivot are influenced by the trade-off between maintaining high quality integration results while at the same time building a system that processes and integrates events in near real-time. We evaluate our design decisions with experiments on two real-world datasets and generalize our findings to other data integration tasks that have a similar system setup.
What problem does this paper attempt to address?