Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Javier Perera-Lago,Víctor Toscano-Durán,Eduardo Paluzo-Hidalgo,Sara Narteni,Matteo Rucco
2024-04-15
Abstract:Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to assess the reliability of decision trees in handling unseen vehicle collision data, particularly in terms of dataset similarity. The authors use a method called ε-representativeness to theoretically analyze and experimentally validate the prediction similarity of decision trees. They focus on the decision tree family because these models are interpretable and widely used in machine learning components such as XGBoost. The paper first emphasizes the importance of representative datasets in the development of artificial intelligence, as it affects the training and generalization ability of models. Then, the authors propose ε-representativeness as a measure of dataset similarity and prove that if two datasets are related in terms of ε-representativeness, i.e., their point distance is less than ε, the predictions of classical decision trees will be similar. The experiments also show that ε-representativeness is highly correlated with feature importance ranking, indicating a significant association between ε-representativeness and interpretability of the XGBoost model when dealing with vehicle collision data. Additionally, the paper demonstrates the relationship between ε-representativeness and feature importance through experiments, and extends it to XGBoost, a widely used machine learning algorithm for tabular data. The experimental results show that data subsets with low ε-representativeness can produce similar feature importance rankings, thereby providing similar decision boundaries and prediction performance. The paper concludes that ε-representativeness can be used to ensure the accuracy of models, and in decision trees and XGBoost, when the dataset has high ε-representativeness, the order of feature importance is similar, which means that the model's interpretation of the data is also similar. Future work will focus on providing theoretical guarantees for comparing feature importance rankings and distance-based decision rules.