Analysis of UMAP, the method for reducing the dimensionality of initial data in machine learning for the purpose of failure prediction in a motive power service

O. B. Pronevich,A. P. Klokova
DOI: https://doi.org/10.21683/1729-2646-2022-22-4-53-62
2022-11-22
Dependability
Abstract:Aim. Feature transformation is one of the stages of machine learning application that has a significant effect on the quality of regression models. The paper aims to develop criteria for evaluating the quality of data dimensionality reduction at the stage of feature transformation and adaptation of the UMAP method to the problem of prediction of the number of days to failure in the locomotives of JSC RZD. Methods. The data transformation methods are divided into two groups, those that attempt to preserve the global data structure, and those that attempt to preserve the distances between points. The paper examines in detail the UMAP no-linear method of dimensionality reduction, whose low-dimensional data presentation is based on a transformation of a nearest neighbour graph retaining the data structure. The structure of the initial data manifold is examined using topological data analysis and simplified fuzzy set construction methods. Results. The analysis of UMAP theory conducted in the Russian language for the first time enabled a substantiated identification of the three primary parameters of the method, whose variation significantly affects the type of data obtained as the result of a transformation. In particular, that pertains to the quality of class separation over a two-dimensional space. Additionally, the characteristics of the input set of parameters were identified that affect the UMAP results. Practical results of UMAP application were demonstrated. Intermediate results included a list of nearest neighbours, a weighted graph of nearest neighbours. The fundamental result is a low-dimensional data representation (out of 44 initial measurements) over a two-dimensional space with class separation, which is confirmed both by calculations, and visually. Conclusions. It was identified that UMAP is an efficient and substantiated method of dimensionality reduction that allows – through parameter variation – transforming data in such a way as to improve the quality of data submitted to machine learning models by the criterion of “evident class separation”. The transformation is an intermediate stage of data preparation for regression model application, and class separation was performed for the purpose of eliminating the probability of gross regression errors.
What problem does this paper attempt to address?