Change point detection in high dimensional data with U-statistics

B. Cooper Boniece,Lajos Horváth,Peter Jacobs
DOI: https://doi.org/10.1007/s11749-023-00900-y
2023-12-15
Abstract:We consider the problem of detecting distributional changes in a sequence of high dimensional data. Our approach combines two separate statistics stemming from $L_p$ norms whose behavior is similar under $H_0$ but potentially different under $H_A$, leading to a testing procedure that that is flexible against a variety of alternatives. We establish the asymptotic distribution of our proposed test statistics separately in cases of weakly dependent and strongly dependent coordinates as $\min\{N,d\}\to\infty$, where $N$ denotes sample size and $d$ is the dimension, and establish consistency of testing and estimation procedures in high dimensions under one-change alternative settings. Computational studies in single and multiple change point scenarios demonstrate our method can outperform other nonparametric approaches in the literature for certain alternatives in high dimensions. We illustrate our approach though an application to Twitter data concerning the mentions of U.S. Governors.
Statistics Theory,Methodology
What problem does this paper attempt to address?
This paper aims to address the problem of distribution change detection in high-dimensional data. Specifically, the authors propose a new statistical method to identify distribution change points in sequences of high-dimensional data. This method combines two different statistics, which behave similarly under the null hypothesis (H0) but may differ under the alternative hypothesis (HA). The method establishes the asymptotic distribution of the proposed test statistic under both weakly dependent and strongly dependent coordinates and demonstrates the consistency of the testing and estimation process in high-dimensional data. The authors illustrate their method with an application example using Twitter data, showing that their method can outperform other non-parametric methods under certain alternative hypotheses in high-dimensional data. Additionally, the method is flexible, requiring fewer assumptions about the growth rate of data dimensions relative to the sample size, making it suitable for modern big data applications.