Abstract:We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of $n$ events $S=\{x_1, \ldots, x_n\}$, where each event $x_i$ has an associated probability $p(x_i)$. The subset sampling problem aims to sample a subset $T \subseteq S$, such that every $x_i$ is independently included in $S$ with probability $p_i$. A naive solution is to flip a coin for each event, which takes $O(n)$ time. However, the specific goal is to develop data structures that allow drawing a sample in time proportional to the expected output size $\mu=\sum_{i=1}^n p(x_i)$, which can be significantly smaller than $n$ in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade. However, most of the existing subset sampling approaches are conducted in a static setting, where the events or their associated probability in set $S$ is not allowed to be changed over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with changing probability in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms. In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: influence maximization. We empirically show that our ODSS can improve the complexities of existing influence maximization algorithms on large real-world evolving social networks.

Perfect and Maximum Randomness in Stratified Sampling over Joins

Random Sampling over Joins Revisited

Sampling over Union of Joins

Reservoir Sampling over Joins

A Simple Algorithm for Worst-Case Optimal Join and Sampling

Join Sampling under Acyclic Degree Constraints and (Cyclic) Subgraph Sampling

Improving Distributed Similarity Join in Metric Space with Error-bounded Sampling

On the implementation of stratified two-stage simple random sampling without replacement, with possible collapsed strata

Optimized stratified sampling for approximate query processing

NOCAP: Near-Optimal Correlation-Aware Partitioning Joins

Optimal Dynamic Subset Sampling: Theory and Applications

Subset Sampling and Its Extensions

Covering the Relational Join

Methods for Combining Probability and Nonprobability Samples Under Unknown Overlaps

Variance-Optimal Offline and Streaming Stratified Random Sampling

A Combinatorial Central Limit Theorem for Stratified Randomization

The Randomness Recycler: A new technique for perfect sampling

Guaranteeing the Õ(AGM/OUT) Runtime for Uniform Sampling and OUT Size Estimation over Joins

Dynamic Sampling Allocation and Design Selection.

Random Sampling for Group-By Queries

Sampling with Costs