Abstract:Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model's complexity, power, and uncertainties. In this paper, we investigate the reliability of the $\varepsilon$-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by $\varepsilon$-representativeness, i.e., both of them have points closer than $\varepsilon$, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that $\varepsilon$-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine-learning component widely adopted for dealing with tabular data.

What problem does this paper attempt to address?

This paper mainly discusses how to assess the reliability of decision trees in handling unseen vehicle collision data, particularly in terms of dataset similarity. The authors use a method called ε-representativeness to theoretically analyze and experimentally validate the prediction similarity of decision trees. They focus on the decision tree family because these models are interpretable and widely used in machine learning components such as XGBoost. The paper first emphasizes the importance of representative datasets in the development of artificial intelligence, as it affects the training and generalization ability of models. Then, the authors propose ε-representativeness as a measure of dataset similarity and prove that if two datasets are related in terms of ε-representativeness, i.e., their point distance is less than ε, the predictions of classical decision trees will be similar. The experiments also show that ε-representativeness is highly correlated with feature importance ranking, indicating a significant association between ε-representativeness and interpretability of the XGBoost model when dealing with vehicle collision data. Additionally, the paper demonstrates the relationship between ε-representativeness and feature importance through experiments, and extends it to XGBoost, a widely used machine learning algorithm for tabular data. The experimental results show that data subsets with low ε-representativeness can produce similar feature importance rankings, thereby providing similar decision boundaries and prediction performance. The paper concludes that ε-representativeness can be used to ensure the accuracy of models, and in decision trees and XGBoost, when the dataset has high ε-representativeness, the order of feature importance is similar, which means that the model's interpretation of the data is also similar. Future work will focus on providing theoretical guarantees for comparing feature importance rankings and distance-based decision rules.

Application of the representative measure approach to assess the reliability of decision trees in dealing with unseen vehicle collision data

Handling Missing Data in Decision Trees: A Probabilistic Approach

Building more accurate decision trees with the additive tree

Dive into Decision Trees and Forests: A Theoretical Demonstration

Enhancing Autonomous Vehicle Decision-Making at Intersections in Mixed-Autonomy Traffic: A Comparative Study Using an Explainable Classifier

Predicting and Analysing Road Accident Severity with Machine Learning Models and Resampling

Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data

Importance measures derived from random forests: characterisation and extension

Learning accurate and interpretable decision trees

Comparing Resampling Algorithms and Classifiers for Modeling Traffic Risk Prediction

Decision Tree Learning for Uncertain Clinical Measurements

Correlation and Unintended Biases on Univariate and Multivariate Decision Trees

Quality Diversity Evolutionary Learning of Decision Trees

Non-uniform feature sampling for decision tree ensembles

Learning decision trees through Monte Carlo tree search: An empirical evaluation

The Power of Unbiased Recursive Partitioning: A Unifying View of CTree, MOB, and GUIDE

Defining machine learning algorithms as accident prediction models for Italian two-lane rural, suburban, and urban roads

Tree-based Machine Learning Ensembles and Feature Importance Approach for the Identification of Intrusions in UNR-IDD Dataset

Machine Learning Meets Microeconomics: The Case of Decision Trees and Discrete Choice

Comparison Analysis of Tree Based and Ensembled Regression Algorithms for Traffic Accident Severity Prediction

Estimating the structural diversity introduced by decision forest algorithms : A probabilistic approach