Abstract:Query processing in the context of integrating autonomous data sources on the Internet has received significant attention of late. In contrast to traditional query processing scenarios, in which each relation is stored in the same primary database and in which completeness of answers is expected by users, data integration scenarios involve handling relations that are stored across multiple and potentially overlapping sources and dealing with conflicting objectives in terms of what coverage of answers users want and how much execution cost they are willing to bear for achieving the desired coverage. Hence, query processing in data integration requires coverage and overlap statistics about these autonomous sources to generate optimal query plans. This dissertation first presents StatMiner, an effective statistics mining approach which automatically generates attribute value hierarchies, discovers frequently accessed query classes, and learns coverage and overlap statistics only with respect to these classes. The dissertation then introduces Multi-R, a multi-objective query optimizer which uses coverage and overlap statistics to support joint optimization of coverage and cost of query plans. The efficiency of StatMiner and the effectiveness of the learned statistics are demonstrated in the context of BibFinder, a publicly available bibliography mediator developed as a testbed for this work. The empirical evaluation of Multi-R also shows that the generated query plans are significantly better than the existing approaches, both in terms of planning cost and in terms of plan execution cost.

Mining Coverage Statistics for Websource Selection in a Mediator

Mining Source Coverage Statistics for Data Integration

Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

Mining and using coverage and overlap statistics for data integration

Mining, Using and Maintaining Source Statistics for Adaptive Data Integration

Community Mining From Multi-Relational Networks

Joint Use of Multiple Learned Statistics for Improving Online Source Selection

Interactive Rare-Category-of-Interest Mining from Large Datasets

Mining Spread Patterns of Spatio-temporal Co-occurrences over Zones

Spatial Co-Location Pattern Discovery Without Thresholds

Improving Mining Quality by Exploiting Data Dependency

Visual Analysis of User-Driven Association Rule Mining

Mining test oracles of web search engines.

Data source selection for information integration in big data era

Mining Succinct and High-Coverage API Usage Patterns from Source Code

Mining Hidden Community in Heterogeneous Social Networks

Joint Optimization of Cost and Coverage of Query Plans in Data Integration

A Learning-Based Approach to Estimate Statistics of Operators in Continuous Queries: a Case Study.

Mining Extremely Small Data Sets with Application to Software Reuse

Mining Cohesive Domain Topics from Source Code

"What makes my queries slow?": Subgroup Discovery for SQL Workload Analysis