Abstract:Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the sparsity of classification trees to enhance their interpretability. Specifically, the authors propose a continuous optimization method to construct Optimal Randomized Classification Trees (ORCTs), and pay special attention to reducing the number of predictor variables used for splitting nodes, thereby achieving local and global sparsity. ### Problem Background Traditional decision trees (such as CART) are constructed using a greedy algorithm. Although the training speed is fast, it is difficult to control sparsity, especially in the case of large depths, resulting in poor global sparsity. In recent years, methods based on overall optimization (such as optimal decision trees) have shown better learning performance, especially when using oblique cuts. However, these methods usually sacrifice interpretability because all predictor variables may appear in each branch rule. ### Paper Objectives To meet this challenge, the authors propose a new optimization method to simultaneously achieve local and global sparsity by introducing regularization terms. Specifically, they use the following formulas: - **Local Sparsity**: Achieved through $\ell_1$-norm regularization. - **Global Sparsity**: Achieved through $\ell_\infty$-norm regularization. The optimization problem can be expressed as: \[ \min g(a, \mu, C)+\lambda_L \sum_{j = 1}^p \|a_{j\cdot}\|_1+\lambda_G \sum_{j = 1}^p \|a_{j\cdot}\|_\infty \] where: - $g(a, \mu, C)$ is the expected misclassification cost on the training samples. - $\lambda_L$ and $\lambda_G$ are the regularization parameters for local and global sparsity respectively. - $a_{j\cdot}$ is the coefficient vector of the $j$-th predictor variable on all branch nodes. ### Main Contributions 1. **Sparsity Control**: By introducing appropriate regularization terms, it is possible to improve local and global sparsity without affecting classification accuracy. 2. **Optimization Method**: A continuous optimization framework is proposed, which can be solved by standard continuous optimization solvers. 3. **Experimental Verification**: Through experiments on multiple real - world datasets, the effectiveness of the proposed method is verified, and the advantages compared to classical methods (such as CART) are demonstrated. In summary, this paper aims to improve the sparsity and interpretability of classification trees through optimization methods, so that the model is easier to understand and apply while maintaining high classification performance.

Sparsity in Optimal Randomized Classification Trees

Optimal Sparse Regression Trees

Optimal Sparse Decision Trees

Fast Sparse Decision Tree Optimization via Reference Ensembles

Statistical Advantages of Oblique Randomized Decision Trees and Forests

Convergence Rates of Oblique Regression Trees for Flexible Function Libraries

Optimal Sparse Survival Trees

Nonlinear Optimal Classification Trees

Mixed-Integer Linear Optimization for Semi-Supervised Optimal Classification Trees

Strong Optimal Classification Trees

Efficient non-greedy optimization of decision trees

Improving Decision Sparsity

Large Scale Prediction with Decision Trees

Decision Trees for Decision-Making under the Predict-then-Optimize Framework

Learning Optimal Classification Trees Robust to Distribution Shifts

Optimal Mixed Integer Linear Optimization Trained Multivariate Classification Trees

Trading Complexity for Sparsity in Random Forest Explanations

Sparse oblique decision trees: a tool to understand and manipulate neural net features

Trading accuracy for sparsity in optimization problems with sparsity constraints

Rolling Lookahead Learning for Optimal Classification Trees

Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest