Bulk Johnson-Lindenstrauss Lemmas

Michael P. Casey
2023-07-15
Abstract:For a set $X$ of $N$ points in $\mathbb{R}^D$, the Johnson-Lindenstrauss lemma provides random linear maps that approximately preserve all pairwise distances in $X$ -- up to multiplicative error $(1\pm \epsilon)$ with high probability -- using a target dimension of $O(\epsilon^{-2}\log(N))$. Certain known point sets actually require a target dimension this large -- any smaller dimension forces at least one distance to be stretched or compressed too much. What happens to the remaining distances? If we only allow a fraction $\eta$ of the distances to be distorted beyond tolerance $(1\pm \epsilon)$, we show a target dimension of $O(\epsilon^{-2}\log(4e/\eta)\log(N)/R)$ is sufficient for the remaining distances. With the stable rank of a matrix $A$ as $\lVert{A\rVert}_F^2/\lVert{A\rVert}^2$, the parameter $R$ is the minimal stable rank over certain $\log(N)$ sized subsets of $X-X$ or their unit normalized versions, involving each point of $X$ exactly once. The linear maps may be taken as random matrices with i.i.d. zero-mean unit-variance sub-gaussian entries. When the data is sampled i.i.d. as a given random vector $\xi$, refined statements are provided; the most improvement happens when $\xi$ or the unit normalized $\widehat{\xi-\xi'}$ is isotropic, with $\xi'$ an independent copy of $\xi$, and includes the case of i.i.d. coordinates.
Probability,Computational Geometry,Information Theory,Metric Geometry,Statistics Theory
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to approximately maintain the distances between most pairs of points even under a relatively small target dimension when performing dimension reduction. Specifically, the paper explores whether at least a proportion of $(1 - \eta)$ of the distances can be approximately preserved within a certain error range when the target dimension $k$ is less than the dimension $D_{JL}=O(\epsilon^{-2}\log(N^{2}))$ required by the traditional Johnson - Lindenstrauss (JL) lemma. Here, $\eta$ is a proportion value less than 1, representing the proportion of distances that can be tolerated not to be approximately preserved. ### Background and Motivation The Johnson - Lindenstrauss lemma provides a method to project a set of points in a high - dimensional space to a lower - dimensional space through a random linear mapping while approximately maintaining the distances between all pairs of points. The traditional JL lemma requires that the target dimension $k$ is at least $O(\epsilon^{-2}\log(N))$ to ensure that all distances are approximately preserved within the range of $(1\pm\epsilon)$, where $\epsilon$ is the error tolerance and $N$ is the number of points. However, for some algorithms, especially those whose computational complexity grows exponentially in high - dimensional spaces (such as nearest - neighbor search), even if the JL lemma is used for pre - processing, the target dimension $k$ may still be too large, resulting in inefficiency in practical applications. Therefore, researchers began to explore whether most distances can still be approximately maintained under a smaller target dimension. ### Main Contributions The main contributions of the paper include: 1. **Theoretical Results**: - The paper proves that when the target dimension $k$ is $O\left(\frac{\epsilon^{-2}\log(4e/\eta)\log(N)}{R}\right)$, a proportion of $(1 - \eta)$ of the distances can be approximately maintained. Here, $R$ is the minimum stable rank of the matrix, defined as $\frac{\|A\|_F^{2}}{\|A\|^{2}}$. - For independently and identically distributed (i.i.d.) data, the paper provides more refined results, especially in cases where the data is isotropic or becomes isotropic after unit normalization. 2. **Technical Means**: - The paper introduces the Walecki construction, which is a method of decomposing a complete graph $K_N$ into multiple cycles, each cycle containing $N$ vertices. This method helps to control the approximate preservation of distances in smaller batches. - The paper also utilizes the concept of stable rank and probabilistic tools such as the Hanson - Wright inequality to analyze the performance of random matrices in dimension reduction. ### Conclusion The paper shows that through appropriate random linear mappings, most distances between pairs of points can be approximately maintained under a smaller target dimension. This result is of great significance for improving the efficiency of high - dimensional data processing, especially in scenarios requiring large - scale data processing.