Abstract:We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter $\varepsilon$ and a set $P$ of $n$ points in $\mathbb{R}^d$ where each point is assigned a color from a set $C$, and we want to build a structure s.t. given any geometric range $\gamma$, we can efficiently find a list of approximate heavy hitters in $\gamma\cap P$, i.e., colors that appear at least $\varepsilon |\gamma \cap P|$ times in $\gamma \cap P$, as well as their frequencies with an additive error of $\varepsilon |\gamma \cap P|$. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence $S$ of $1+1/\varepsilon$ weights s.t. the $i$-th weight in $S$ has approximate rank $i\varepsilon|\gamma\cap P|$, meaning, rank $i\varepsilon|\gamma\cap P|$ up to an additive error of $\varepsilon|\gamma\cap P|$. Previously, optimal results were only known in 1D [WY11] but a few sub-optimal methods were available in higher dimensions [AW17, ACH+12]. We study the problems for 3D halfspace and dominance queries. We consider the real RAM model with integer registers of size $w=\Theta(\log n)$ bits. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time $O(\log n + 1/\varepsilon)$. Note that as the output size is $\frac{1}{\varepsilon}$, after investing the initial $O(\log n)$ searching time, our structure takes on average $O(1)$ time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra $\log_w\frac{1}{\varepsilon}$ (resp. $\log\log_w\frac{1}{\varepsilon}$) factor in 3D (resp. 2D). By spending extra $\log^{O(1)}\frac{1}{\varepsilon}$ factors in time and space, we can also support quantile queries.

Computing Data Distribution from Query Selectivities

Distributed Privacy-Aware Fast Selection Algorithm for Large-Scale Data.

Range (Rényi) Entropy Queries and Partitioning

Consistent and Flexible Selectivity Estimation for High-Dimensional Data

Non-Stochastic CDF Estimation Using Threshold Queries

The Optimization Of The Range-Count Queries In Differential Privacy

Selective Inference with Distributed Data

Selectivity Estimation of Inequality Joins In Databases

Effective Data Distribution And Reallocation Strategies For Fast Query Response In Distributed Query-Intensive Data Environments

Distribution Privacy Under Function Recoverability

Exploring Distributional Discrepancy for Multidimensional Point Set Retrieval

Query Optimisation As Part of Distribution Design for Complex Value Databases

RelJoin: Relative-cost-based Selection of Distributed Join Methods for Query Plan Optimization

On Range Summary Queries

Tailoring data source distributions for fairness-aware data integration

Communication-efficient Estimation for Distributed Subset Selection

A Practical Theory of Generalization in Selectivity Learning

Minimax and Communication-Efficient Distributed Best Subset Selection with Oracle Property

Distribution Design for Higher-Order Data Models

Exact Results for the Distribution of Randomly Weighted Sums