Abstract:Query processing in the context of integrating autonomous data sources on the Internet has received significant attention of late. In contrast to traditional query processing scenarios, in which each relation is stored in the same primary database and in which completeness of answers is expected by users, data integration scenarios involve handling relations that are stored across multiple and potentially overlapping sources and dealing with conflicting objectives in terms of what coverage of answers users want and how much execution cost they are willing to bear for achieving the desired coverage. Hence, query processing in data integration requires coverage and overlap statistics about these autonomous sources to generate optimal query plans. This dissertation first presents StatMiner, an effective statistics mining approach which automatically generates attribute value hierarchies, discovers frequently accessed query classes, and learns coverage and overlap statistics only with respect to these classes. The dissertation then introduces Multi-R, a multi-objective query optimizer which uses coverage and overlap statistics to support joint optimization of coverage and cost of query plans. The efficiency of StatMiner and the effectiveness of the learned statistics are demonstrated in the context of BibFinder, a publicly available bibliography mediator developed as a testbed for this work. The empirical evaluation of Multi-R also shows that the generated query plans are significantly better than the existing approaches, both in terms of planning cost and in terms of plan execution cost.

Mining Source Coverage Statistics for Data Integration

Mining Coverage Statistics for Websource Selection in a Mediator

Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

Mining and using coverage and overlap statistics for data integration

Mining, Using and Maintaining Source Statistics for Adaptive Data Integration

Automatic Accuracy Assessment Via Hashing in Multiple-Source Environment

Joint Use of Multiple Learned Statistics for Improving Online Source Selection

Mining Noise-Tolerant Frequent Closed Itemsets in Very Large Database.

Interactive Rare-Category-of-Interest Mining from Large Datasets

Spatial Co-Location Pattern Discovery Without Thresholds

Mining Regional Co-Location Patterns with Knng

Using Visualization to Improve Clustering Analysis on Heterogeneous Information Network.

Joint Optimization of Cost and Coverage of Query Plans in Data Integration

Mining Succinct and High-Coverage API Usage Patterns from Source Code

Data source selection for information integration in big data era

Improving Mining Quality by Exploiting Data Dependency

Mining Extremely Small Data Sets with Application to Software Reuse

Visual Analysis of User-Driven Association Rule Mining

A Learning-Based Approach to Estimate Statistics of Operators in Continuous Queries: a Case Study.

Software intelligence: the future of mining software engineering data.

Data mining: an overview from a database perspective