Topology-Preserving Scaling in Data Augmentation

Vu-Anh Le,Mehmet Dik
2024-11-29
Abstract:We propose an algorithmic framework for dataset normalization in data augmentation pipelines that preserves topological stability under non-uniform scaling transformations. Given a finite metric space \( X \subset \mathbb{R}^n \) with Euclidean distance \( d_X \), we consider scaling transformations defined by scaling factors \( s_1, s_2, \ldots, s_n > 0 \). Specifically, we define a scaling function \( S \) that maps each point \( x = (x_1, x_2, \ldots, x_n) \in X \) to \[ S(x) = (s_1 x_1, s_2 x_2, \ldots, s_n x_n). \] Our main result establishes that the bottleneck distance \( d_B(D, D_S) \) between the persistence diagrams \( D \) of \( X \) and \( D_S \) of \( S(X) \) satisfies: \[ d_B(D, D_S) \leq (s_{\max} - s_{\min}) \cdot \operatorname{diam}(X), \] where \( s_{\min} = \min_{1 \leq i \leq n} s_i \), \( s_{\max} = \max_{1 \leq i \leq n} s_i \), and \( \operatorname{diam}(X) \) is the diameter of \( X \). Based on this theoretical guarantee, we formulate an optimization problem to minimize the scaling variability \( \Delta_s = s_{\max} - s_{\min} \) under the constraint \( d_B(D, D_S) \leq \epsilon \), where \( \epsilon > 0 \) is a user-defined tolerance. We develop an algorithmic solution to this problem, ensuring that data augmentation via scaling transformations preserves essential topological features. We further extend our analysis to higher-dimensional homological features, alternative metrics such as the Wasserstein distance, and iterative or probabilistic scaling scenarios. Our contributions provide a rigorous mathematical framework for dataset normalization in data augmentation pipelines, ensuring that essential topological characteristics are maintained despite scaling transformations.
Algebraic Topology,Information Theory,Machine Learning
What problem does this paper attempt to address?