Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments

Edith Cohen,Haim Kaplan,Subhabrata Sen
DOI: https://doi.org/10.48550/arXiv.0906.4560
2010-11-10
Abstract:Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vector-weighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the $L_1$ difference. Sample-based summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single {\em scalar} weight is associated with each key. We develop a sampling framework based on {\em coordinated weighted samples} that is suited for multiple weight assignments and obtain estimators that are {\em orders of magnitude tighter} than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.
Databases,Networking and Internet Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively perform aggregated estimation of multi - weight assignment when dealing with large - scale datasets. Specifically, the paper focuses on data collected at multiple time points or different locations, which can be modeled as key sets with multiple weight distributions. For example, snapshots of a database at different time points, measurement data collected during different time periods, the number of requests processed on different servers, etc. For this type of vector - weighted data, researchers are interested in aggregation based on a certain set of weights (such as weighted sum), and aggregation based on multiple sets of weights (such as L1 - difference). Existing sample summary methods are mainly designed for the situation where each key has only one set of weights, and they perform poorly or are not applicable for the aggregated estimation of multi - weight assignment. Therefore, this paper develops a sampling framework based on coordinated weighted samples, which is suitable for multi - weight assignment and can obtain more accurate estimators than previous methods. The paper proves the effectiveness of its method through extensive empirical evaluation, covering a variety of datasets from IP networks to stock quote data. The main contributions of the paper include: 1. **Proposing a sample summary framework for multi - weight assignment data**: This framework can support efficient large - scale data summary and approximate aggregation queries. 2. **Developing coordinated weighted samples for decentralized and centralized weight models**: These samples can be decoupled when dealing with different assignments, thus achieving scalability. 3. **Providing tight unbiased estimators**: These estimators can significantly reduce variance, especially when dealing with the aggregation of multi - weight assignment. 4. **Conducting extensive empirical evaluation**: Using multiple datasets (such as IP packet traces, movie rating datasets, stock quote datasets) to verify the effectiveness of the method, showing great improvement compared to existing methods. In general, this paper aims to solve the problem of efficient estimation of multi - weight assignment aggregation in large - scale datasets, and provides a new sampling framework and estimation method that can support flexible query processing while maintaining low variance.