Abstract:Summary Variable importance measures are the main tools used to analyse the black-box mechanisms of random forests. Although the mean decrease accuracy is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of mean decrease accuracy varies across the main random forest software. In this article, our objective is to rigorously analyse the behaviour of the main mean decrease accuracy implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these mean decrease accuracy versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the mean decrease accuracy does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-mean decrease accuracy, which fixes the flaws of the original mean decrease accuracy, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-mean decrease accuracy empirically outperforms its competitors on both simulated and real data for variable selection. An open source implementation in R and C ++ is available online.

Unbiased variable importance for random forests

Enhancing Variable Importance in Random Forests: A Novel Application of Global Sensitivity Analysis

MDA for random forests: inconsistency, and a practical solution via the Sobol-MDA

Variable ranking and selection with random forest for unbalanced data

Random Forests: some methodological insights

From unbiased MDI Feature Importance to Explainable AI for Trees

Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival

Random Forest Variable Importance-based Selection Algorithm in Class Imbalance Problem

Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests

Models under which random forests perform badly; consequences for applications

From global to local MDI variable importances for random forests and when they are Shapley values

MMD-based Variable Importance for Distributional Random Forest

Inference of genetic networks using random forests: performance improvement using a new variable importance measure

Opening the random forest black box by the analysis of the mutual impact of features

The Importance of Variable Importance

Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

Understanding Random Forests: From Theory to Practice

Process Variable Importance Analysis by Use of Random Forests in a Shapley Regression Framework

Variable importance in binary regression trees and forests

A Debiased MDI Feature Importance Measure for Random Forests

Dimension Reduction Forests: Local Variable Importance using Structured Random Forests