Abstract:This paper studies the \emph{subset sampling} problem. The input is a set $\mathcal{S}$ of $n$ records together with a function $\textbf{p}$ that assigns each record $v\in\mathcal{S}$ a probability $\textbf{p}(v)$. A query returns a random subset $X$ of $\mathcal{S}$, where each record $v\in\mathcal{S}$ is sampled into $X$ independently with probability $\textbf{p}(v)$. The goal is to store $\mathcal{S}$ in a data structure to answer queries efficiently. If $\mathcal{S}$ fits in memory, the problem is interesting when $\mathcal{S}$ is dynamic. We develop a dynamic data structure with $\mathcal{O}(1+\mu_{\mathcal{S}})$ expected \emph{query} time, $\mathcal{O}(n)$ space and $\mathcal{O}(1)$ amortized expected \emph{update}, \emph{insert} and \emph{delete} time, where $\mu_{\mathcal{S}}=\sum_{v\in\mathcal{S}}\textbf{p}(v)$. The query time and space are optimal. If $\mathcal{S}$ does not fit in memory, the problem is difficult even if $\mathcal{S}$ is static. Under this scenario, we present an I/O-efficient algorithm that answers a \emph{query} in $\mathcal{O}\left((\log^*_B n)/B+(\mu_\mathcal{S}/B)\log_{M/B} (n/B)\right)$ amortized expected I/Os using $\mathcal{O}(n/B)$ space, where $M$ is the memory size, $B$ is the block size and $\log^*_B n$ is the number of iterative $\log_2(.)$ operations we need to perform on $n$ before going below $B$. In addition, when each record is associated with a real-valued key, we extend the \emph{subset sampling} problem to the \emph{range subset sampling} problem, in which we require that the keys of the sampled records fall within a specified input range $[a,b]$. For this extension, we provide a solution under the dynamic setting, with $\mathcal{O}(\log n+\mu_{\mathcal{S}\cap[a,b]})$ expected \emph{query} time, $\mathcal{O}(n)$ space and $\mathcal{O}(\log n)$ amortized expected \emph{update}, \emph{insert} and \emph{delete} time.

Random sampling from databases: a survey

Random Sampling over Joins Revisited

Sampling in software engineering research: a critical review and guidelines

Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review

Sampling over Union of Joins

Adaptive Sampling For Selectivity Estimation In Spatial Database

Detecting random sets by samplings from their values

Sequential sampling procedures for query size estimation

Scalable Sampling for High Utility Patterns

Optimized stratified sampling for approximate query processing

Design and Implementation of Random Selection

Sampling strategies for mining in data-scarce domains

Reservoir Sampling over Joins

Mosaic: A Sample-Based Database System for Open World Query Processing

The Randomness Recycler: A new technique for perfect sampling

Spatially Balanced Sampling of Natural Resources

Combining Sampling Technique With Dbscan Algorithm For Clustering Large Spatial Databases

Subset Sampling and Its Extensions

A Comparison of Techniques for Sampling Web Pages

When Quantum Computing Meets Database: A Hybrid Sampling Framework for Approximate Query Processing

Effective Keyword-Based Selection of Relational Databases.