Greedy Algorithms for Approximating the Diameter of Machine Learning Datasets in Multidimensional Euclidean Space

Ahmad B. Hassanat
DOI: https://doi.org/10.48550/arXiv.1808.03566
2018-08-10
Abstract:Finding the diameter of a dataset in multidimensional Euclidean space is a well-established problem, with well-known algorithms. However, most of the algorithms found in the literature do not scale well with large values of data dimension, so the time complexity grows exponentially in most cases, which makes these algorithms impractical. Therefore, we implemented 4 simple greedy algorithms to be used for approximating the diameter of a multidimensional dataset; these are based on minimum/maximum l2 norms, hill climbing search, Tabu search and Beam search approaches, respectively. The time complexity of the implemented algorithms is near-linear, as they scale near-linearly with data size and its dimensions. The results of the experiments (conducted on different machine learning data sets) prove the efficiency of the implemented algorithms and can therefore be recommended for finding the diameter to be used by different machine learning applications when needed.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient approximation of the diameter of a data set in multi - dimensional Euclidean space. Specifically, the paper focuses on high - dimensional data sets, where existing algorithms are inefficient in practical applications, especially in online application scenarios, due to their high time complexity (usually quadratic or exponential). Therefore, the author proposes four simple algorithms based on the greedy strategy, aiming to approximate the diameter of the data set with near - linear complexity, thereby improving the practicality and efficiency of the algorithms. These algorithms are: 1. **Minimum/Maximum L2 Norm**: Calculate the L2 norm of each point, and select the points with the minimum and maximum norms for comparison to find the farthest pair of points. 2. **Hill Climbing**: Start from a random point and gradually search for the farthest point until no farther point can be found. 3. **Tabu Search**: Based on the Hill Climbing method, record all points with the same maximum distance to increase the accuracy of the algorithm. 4. **Beam Search**: Start from multiple random points, share the information of visited points to reduce duplicate calculations and improve efficiency. The paper verifies the performance of these four algorithms on multiple machine - learning data sets through experiments. The results show that these algorithms significantly reduce the computation time while maintaining high precision, and are particularly suitable for processing high - dimensional data sets.