Abstract:We developed a novel machine learning (ML) algorithm with the goal of producing transparent models (i.e., understandable by humans) while also flexibly accounting for nonlinearity and interactions. Our method is based on ranked sparsity, and it allows for flexibility and user control in varying the shade of the opacity of black box machine learning methods. The main tenet of ranked sparsity is that an algorithm should be more skeptical of higher-order polynomials and interactions a priori compared to main effects, and hence, the inclusion of these more complex terms should require a higher level of evidence. In this work, we put our new ranked sparsity algorithm (as implemented in the open source R package, sparseR) to the test in a predictive model "bakeoff" (i.e., a benchmarking study of ML algorithms applied "out of the box", that is, with no special tuning). Algorithms were trained on a large set of simulated and real-world data sets from the Penn Machine Learning Benchmarks database, addressing both regression and binary classification problems. We evaluated the extent to which our human-centered algorithm can attain predictive accuracy that rivals popular black box approaches such as neural networks, random forests, and support vector machines, while also producing more interpretable models. Using out-of-bag error as a meta-outcome, we describe the properties of data sets in which human-centered approaches can perform as well as or better than black box approaches. We found that interpretable approaches predicted optimally or within 5% of the optimal method in most real-world data sets. We provide a more in-depth comparison of the performances of random forests to interpretable methods for several case studies, including exemplars in which algorithms performed similarly, and several cases when interpretable methods underperformed. This work provides a strong rationale for including human-centered transparent algorithms such as ours in predictive modeling applications.

Automatic piecewise linear regression

Boosting the Partial Least Square Algorithm for Regression Modelling

Boosting The Pls Algorithm For Regressive Modelling

Piecewise linear regression and classification

SMART: A Flexible Approach to Regression using Spline-Based Multivariate Adaptive Regression Trees

New Partially Linear Regression and Machine Learning Models Applied to Agronomic Data

Fair Multivariate Adaptive Regression Splines for Ensuring Equity and Transparency

Computationally Intensive Nonlinear Regression Methods

Adaptive predictor-set linear model: an imputation-free method for linear regression prediction on datasets with missing values

Adaptive predictor‐set linear model: An imputation‐free method for linear regression prediction on data sets with missing values

A Regression Algorithm Based on AdaBoost

Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

GAM(L)A: An econometric model for interpretable Machine Learning

Fast Linear Model Trees by PILOT

Explainable boosted linear regression for time series forecasting

A Locally Adaptive Interpretable Regression

Random Planted Forest: a directly interpretable tree ensemble

The Artificial Regression Market

Can a Transparent Machine Learning Algorithm Predict Better than Its Black Box Counterparts? A Benchmarking Study Using 110 Data Sets

Adaptive Bayesian Linear Regression for Automated Machine Learning

Comprehensive Stepwise Selection for Logistic Regression