Time Series Clustering Using DBSCAN

Nicholas Waltz
2024-03-22
Abstract:Economic policy and research rely on the correct evaluation of the billions of high-frequency data points that we collect every day. Consistent clustering algorithms, like DBSCAN, allow us to make sense of the data in a useful way. However, while there is a large literature on the consistency of various clustering algorithms for high-dimensional static clustering, the literature on multivariate time series clustering still largely relies on heuristics or restrictive assumptions. The aim of this paper is to prove a notion of consistency of DBSCAN for the task of clustering multivariate time series.
Statistics Theory
What problem does this paper attempt to address?
This paper mainly discusses the consistency problem of the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm in time series clustering. The background of the study is the correct evaluation of a large number of high-frequency data points in economic policy and research, and consistent clustering algorithms can help understand these data. Although there is a lot of literature on static data density clustering, research on multivariate time series clustering is still mainly based on heuristic methods or strict assumptions. The goal of the paper is to prove the consistency of DBSCAN in clustering multivariate time series tasks. The author points out that time series clustering has wide applications in fields such as finance and economics, such as stock data grouping and disease classification. However, current methods either lack statistical rigor or rely on subjective data selection, which may result in biases and incorrect regression phenomena. DBSCAN is a non-parametric clustering algorithm that does not require a predefined number of clusters, but it faces challenges when dealing with time series, such as unclear comparison criteria and data preprocessing requirements. The paper proposes that under appropriate assumptions, the results of the static situation can be applied to time series problems, and discusses how to deal with incomplete data, noise, and phase issues. The main contributions of the paper include: 1) proving that multivariate problems can be "flattened," meaning that the concentration bounds of static data also apply to time series; 2) estimating the consistency of functional forms in the presence of noise and incomplete data; 3) discussing dimensionality issues. The paper also reviews relevant literature, especially the consistency results of multivariate functional data analysis and static data density clustering. By demonstrating the clustering consistency of DBSCAN when dealing with properly preprocessed data, the paper provides a theoretical foundation for time series clustering, especially for multivariate functional data in economic research.