Hengrui Luo,Steve N. MacEachern,Mario Peruggia
Abstract:Topological data analysis (TDA) allows us to explore the topological features of a dataset. Among topological features, lower dimensional ones have recently drawn the attention of practitioners in mathematics and statistics due to their potential to aid the discovery of low dimensional structure in a data set. However, lower dimensional features are usually challenging to detect based on finite samples and using TDA methods that ignore the probabilistic mechanism that generates the data. In this paper, lower dimensional topological features occurring as zero-density regions of density functions are introduced and thoroughly investigated. Specifically, we consider sequences of coverings for the support of a density function in which the coverings are comprised of balls with shrinking radii. We show that, when these coverings satisfy certain sufficient conditions as the sample size goes to infinity, we can detect lower dimensional, zero-density regions with increasingly higher probability while guarding against false detection. We supplement the theoretical developments with the discussion of simulated experiments that elucidate the behavior of the methodology for different choices of the tuning parameters that govern the construction of the covering sequences and characterize the asymptotic results.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to identify low - dimensional geometric features generated by density functions in data streams. These features cannot be detected with a finite amount of data because traditional topological data analysis (TDA) techniques rely on the relative arrangement of points in a finite data set to understand the spatial structure in which the data occurs. Specifically, the paper focuses on low - dimensional topological features that appear as zero - density regions and studies these features by introducing a series of spheres covering the support density function, where the radius of the spheres gradually decreases. The main objectives of the paper are:
1. **Detect low - dimensional zero - density regions**: When the sample size tends to infinity, it is able to detect low - dimensional zero - density regions with an increasingly high probability while avoiding false detections.
2. **Combine theory and experiment**: It not only provides theoretical developments but also discusses, through simulation experiments, the influence of different tuning parameters on the construction of the covering sequence and how these parameters characterize the asymptotic results.
### Specific problem description
The paper mainly discusses the case of independent and identically distributed (i.i.d.) data points drawn from a distribution with a continuous density function \( f \). The support set of the density function \( f \), \( \text{supp}(f)=M\subset\mathbb{R}^d \). The paper first formally states the results in the case of \( \text{supp}(f) = M = [0,1]^d \) and then extends them to more general cases. The paper uses a well - behaved version of the density function so that the concept of the zero - density region \( S_0\subset\text{supp}(f)\subset M \) is meaningful. Such a zero - density region \( S_0 \) is difficult to identify by traditional simplex or density estimation methods.
### Main results
The main result of the paper (Theorem 3.11) shows that, under appropriate conditions, as the sample size \( n \) tends to infinity, the low - dimensional object \( S_0 \) can be detected with probability 1 while avoiding the detection of false holes. The specific conditions include:
- **Growth rates of radius and separation distance**:
\[
r(n)\sim O(n^{-\eta}),\quad 0 < \eta < \frac{1}{d},
\]
\[
\epsilon(n)\sim O(n^{-\psi}),\quad 0 < \psi\leq\eta,
\]
\[
2r(n)\leq\epsilon(n)< 1.
\]
- **Boundary conditions of the density function outside the \(\epsilon\)-neighborhood**:
\[
m(f,n):=\min_{w\in M\setminus B_\epsilon(S_0)} f(w)\in(0,\infty)\sim O(n^{-\xi}),\quad 0 < \xi < 1-\frac{2\eta d}{2}.
\]
### Theorem conclusions
- **(A)** If \(\eta\) and \(\psi\) satisfy:
\[
1 - 2\eta d- 2K_f\psi> 0,
\]
then:
\[
\lim_{n\rightarrow\infty}P(\text{no empty }\epsilon(n)\text{-outer spheres}) = 1.
\]
- **(B)** If \(\eta\) satisfies:
\[
1 + d_0\eta- K_f\eta- d\eta < 0,
\]
then:
\[
\lim_{n\rightarrow\infty}P(\text{all }\epsilon(n)\text{-inner spheres are empty}) = 1.
\]
### Corollary
The paper also provides results when the zero - density region \( S_0 \) consists of multiple (but a finite number of) connected components (Corollary 3.12).