Multidimensional scaling for big data

Pedro Delicado,Cristian Pachón-García
DOI: https://doi.org/10.1007/s11634-024-00591-9
2024-04-14
Advances in Data Analysis and Classification
Abstract:We present a set of algorithms implementing multidimensional scaling (MDS) for large data sets. MDS is a family of dimensionality reduction techniques using a distance matrix as input, where n is the number of individuals, and producing a low dimensional configuration: a matrix with \(r<<n\) . When n is large, MDS is unaffordable with classical MDS algorithms because their extremely large memory and time requirements. We compare six non-standard algorithms intended to overcome these difficulties. They are based on the central idea of partitioning the data set into small pieces, where classical MDS methods can work. Two of these algorithms are original proposals. In order to check the performance of the algorithms as well as to compare them, we have done a simulation study. Additionally, we have used the algorithms to obtain an MDS configuration for EMNIST: a real large data set with more than 800000 points. We conclude that all the algorithms are appropriate to use for obtaining an MDS configuration, but we recommend to use one of our proposals, since it is a fast algorithm with satisfactory statistical properties when working with big data. An R package implementing the algorithms has been created.</n\)<>
statistics & probability
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational resource limitations encountered when applying Multidimensional Scaling (MDS) techniques on large - scale data sets. Traditional MDS algorithms become infeasible when dealing with large - scale data sets due to their extremely high memory and time requirements (the time complexity of the algorithm is O(n^3), where n is the number of data points). Specifically, when the data set is very large, traditional MDS algorithms need to store and process a huge distance matrix, which leads to extremely high computational costs and makes these algorithms difficult to implement in practical applications. To solve this problem, the paper proposes a series of non - standard MDS algorithms, aiming to reduce the amount of computation through different strategies, so that MDS can run effectively on large - scale data sets. These algorithms include but are not limited to: 1. **Interpolation MDS**: A small sample is randomly selected from the large - scale data set for classical MDS analysis, and then the remaining data points are projected into a low - dimensional configuration using Gower's interpolation formula. 2. **Divide - and - Conquer MDS**: The large - scale data set is divided into multiple small parts, MDS analysis is carried out for each part respectively, and finally the results of each part are combined through Procrustes transformation. 3. **Fast MDS**: A recursive strategy is adopted to divide the data set into smaller parts and continue to divide when necessary until the size of each part is small enough to directly apply the MDS algorithm, and finally the results of each part are combined through Procrustes transformation. Through simulation studies and applications to real - large - scale data sets (such as EMNIST), the paper compares the performance of these newly proposed algorithms with existing methods (such as Landmark MDS, Pivot MDS, etc.), and evaluates their performance in capturing data dimensions and computational speed. The research shows that all the proposed algorithms can be effectively applied to MDS analysis of large - scale data sets, but the author recommends using a new algorithm proposed by them because it is fast in processing large - scale data and has good statistical properties. In addition, the author has also developed an R package `bigmds` to implement these algorithms for the convenience of researchers and practitioners.