Greedy Algorithms for Approximating the Diameter of Machine Learning Datasets in Multidimensional Euclidean Space

Ahmad B. Hassanat

DOI: https://doi.org/10.48550/arXiv.1808.03566

2018-08-10

Abstract:Finding the diameter of a dataset in multidimensional Euclidean space is a well-established problem, with well-known algorithms. However, most of the algorithms found in the literature do not scale well with large values of data dimension, so the time complexity grows exponentially in most cases, which makes these algorithms impractical. Therefore, we implemented 4 simple greedy algorithms to be used for approximating the diameter of a multidimensional dataset; these are based on minimum/maximum l2 norms, hill climbing search, Tabu search and Beam search approaches, respectively. The time complexity of the implemented algorithms is near-linear, as they scale near-linearly with data size and its dimensions. The results of the experiments (conducted on different machine learning data sets) prove the efficiency of the implemented algorithms and can therefore be recommended for finding the diameter to be used by different machine learning applications when needed.

Machine Learning,Data Structures and Algorithms

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficient approximation of the diameter of a data set in multi - dimensional Euclidean space. Specifically, the paper focuses on high - dimensional data sets, where existing algorithms are inefficient in practical applications, especially in online application scenarios, due to their high time complexity (usually quadratic or exponential). Therefore, the author proposes four simple algorithms based on the greedy strategy, aiming to approximate the diameter of the data set with near - linear complexity, thereby improving the practicality and efficiency of the algorithms. These algorithms are: 1. **Minimum/Maximum L2 Norm**: Calculate the L2 norm of each point, and select the points with the minimum and maximum norms for comparison to find the farthest pair of points. 2. **Hill Climbing**: Start from a random point and gradually search for the farthest point until no farther point can be found. 3. **Tabu Search**: Based on the Hill Climbing method, record all points with the same maximum distance to increase the accuracy of the algorithm. 4. **Beam Search**: Start from multiple random points, share the information of visited points to reduce duplicate calculations and improve efficiency. The paper verifies the performance of these four algorithms on multiple machine - learning data sets through experiments. The results show that these algorithms significantly reduce the computation time while maintaining high precision, and are particularly suitable for processing high - dimensional data sets.

Greedy Algorithms for Approximating the Diameter of Machine Learning Datasets in Multidimensional Euclidean Space

Accelerating Exact Nearest Neighbor Search in High Dimensional Euclidean Space Via Block Vectors

Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces.

On Diameter Approximation in Directed Graphs

Computing Diameter +1 in Truly Subquadratic Time for Unit-Disk Graphs

Towards Metric DBSCAN: Exact, Approximate, and Streaming Algorithms

Optimizing Euclidean Distance Computation

An Empirical Analysis of Approximation Algorithms for the Euclidean Traveling Salesman Problem

Effective and General Distance Computation for Approximate Nearest Neighbor Search

An Efficient Greedy Search Algorithm for High-Dimensional Linear Discriminant Analysis

Big Holes in Big Data: A Monte Carlo Algorithm for Detecting Large Hyper-rectangles in High Dimensional Data

Better Diameter Algorithms for Bounded VC-dimension Graphs and Geometric Intersection Graphs

Multidimensional scaling for big data

Efficient Data-aware Distance Comparison Operations for High-Dimensional Approximate Nearest Neighbor Search

Efficient Data-Driven Leverage Score Sampling Algorithm for the Minimum Volume Covering Ellipsoid Problem in Big Data

Multi-dimensional Scaling from K-Nearest Neighbourhood Distances

Analysis of Agglomerative Clustering

Distance-based outlier detection for high dimension, low sample size data

Online landmark replacement for out-of-sample dimensionality reduction methods

Approximating Metric Magnitude of Point Sets

Barriers for recent methods in geodesic optimization