Mining, Using and Maintaining Source Statistics for Adaptive Data Integration

Jianchun Fan,Subbarao Kambhampati,Zaiqing Nie
2005-01-01
Abstract:To make query processing effective in data integration scenarios, the mediator needs to be able to gather and use statistics about data sources as well as to adapt to the often conflicting user preferences. We present a framework for effectively mining multiple types of statistics including source coverage statistics, inter-source overlap statistics and source latency profiles. Using these statistics enables the mediator to optimize the coverage and execution cost respectively. However, users in data integration systems often require query plans that are optimal with respect to multiple objectives and such objectives often conflict. We present a joint optimization model that uses latency as well as coverage/overlap statistics simultaneously to support a spectrum of tradeoffs between the coverage and latency requirements of query plans. Moreover, motivated by the dynamic and evolving nature of data integration systems, we introduce an incremental approach for maintaining source statistics. We describe the details of our approaches and present extensive experimental results in the context of Bibfinder, a fielded bibliographic mediator system. Our results demonstrate the effectiveness of statistics learning and maintenance and multi-objective optimization.
What problem does this paper attempt to address?