A partition-based approach to support streaming updates over persistent data in an active datawarehouse

Abhirup Chakraborty,Ajit Singh
DOI: https://doi.org/10.1109/ipdps.2009.5161064
2009-05-01
Abstract:Active warehousing has emerged in order to meet the high user demands for fresh and up-to-date information. Online refreshment of the source updates introduces processing and disk overheads in the implementation of the warehouse transformations. This paper considers a frequently occurring operator in active warehousing which computes the join between a fast, time varying or bursty update stream $S$ and a persistent disk relation $R$, using a limited memory. Such a join operation is the crux of a number of common transformations (e.g., surrogate keyas-signment, duplicate detection etc) in an active data ware-house. We propose a partition-based join algorithm that minimizes the processing overhead, disk overhead and the delay in output tuples. The proposed algorithm exploits the spatio-temporal locality within the update stream, and improves the delays in output tuples by exploiting hot-spots in the range or domain of the joining attributes, and at the same time shares the I/O cost of accessing disk data of relation $R$ over a volume of tuples from update stream $S$, We present experimental results showing the effectiveness of the proposed algorithm.
What problem does this paper attempt to address?