Fréchet random forests for metric space valued regression with non euclidean predictors

Louis Capitaine,Jérémie Bigot,Rodolphe Thiébaut,Robin Genuer
2024-02-16
Abstract:Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fréchet trees and Fréchet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fréchet regressogram predictor using data-driven partitions is given and applied to Fréchet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice.
Machine Learning,Methodology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to handle and predict the regression problems of non - Euclidean data (such as curves, images and shapes). Traditional random forest methods have limitations when dealing with such heterogeneous data because they can usually only handle numerical or categorical input variables. To overcome this limitation, the authors introduced Fréchet trees and Fréchet random forests, which are able to handle data where the input and output variables take values in general metric spaces. ### Specific problem description 1. **Handling non - Euclidean data**: - Traditional random forest methods cannot effectively handle non - Euclidean data, such as curves, images and shapes. These problems are very common in practical applications, for example, time - series data in air quality research. 2. **Generalizing splitting rules**: - In order to adapt to data in metric spaces, the way of splitting nodes needs to be re - defined. Traditional splitting rules are based on threshold partitioning of numerical or categorical variables, while splitting rules in metric spaces need to consider more complex geometric structures. 3. **Generalizing prediction methods**: - In metric spaces, the traditional concepts of mean and variance are no longer applicable. Therefore, Fréchet mean and Fréchet variance need to be used to generalize prediction methods. 4. **Maintaining the flexibility and accuracy of the model**: - The new method should not only be able to handle complex data types, but also maintain the high prediction performance of random forests and the good ability to handle high - dimensional data. ### Solution To solve the above problems, the authors proposed the following solutions: - **Introducing Fréchet trees and Fréchet random forests**: - Fréchet trees handle data in metric spaces by introducing new splitting rules. The splitting rules are based on Voronoi partitioning, that is, dividing input elements according to their distances from two central points. - **Using Fréchet mean and Fréchet variance**: - In the prediction process, the Fréchet mean is used as the output prediction value, and the Fréchet variance is used to measure the quality of splitting. This enables the model to handle various types of metric space data. - **Consistency theorem**: - The consistency of the Fréchet regression graph estimator is proved, ensuring that the new method is theoretically reliable. - **Simulation experiments and practical applications**: - The effectiveness of the new method is verified through simulation experiments, and it is applied to an actual air quality data set, demonstrating its application value in practice. In conclusion, this paper aims to solve the limitations of traditional random forest methods in handling non - Euclidean data by introducing Fréchet trees and Fréchet random forests, thereby expanding their application range and improving prediction performance.