Abstract:In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's {\em frequency of frequency} in the {\em distributed streaming model}, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.

Algorithm of Estimating Index Sizes of Resource Collections in Distributed Search

Estimating Collection Size in Distributed Search

A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval

Estimating collection size with logistic regression.

Research on Indexing Page Collection Selection Method for Search Engine.

Estimation of Distribution Algorithm with Scatter Search for Dynamic Optimisation Problems

Towards A Quality-Oriented Real-Time Web Crawler

An Online Algorithm for Data Collection by Multiple Sinks in Wireless-Sensor Networks

Learning to Rank Collections.

Research on Optimization Algorithm for Deep Web Query

Sampling Content Distributed Over Graphs

A hunger-based scheduling strategy for distributed crawler

Allocating images and selecting image collections for distributed visual search

Large-Scale Duplicate Detection for Web Image Search

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

Scaling up Copy Detection

A survey of search technologies in Deep Web

Web Search Engine: Characteristics of User Behaviors and Their Implication

DHT-assisted Probabilistic Exhaustive Search in Unstructured P2P Networks

A Survey on Estimation of Distribution Algorithms

An Optimal Sample Allocation Strategy for Partition-Based Random Search.