Abstract:Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the sparsity of classification trees to enhance their interpretability. Specifically, the authors propose a continuous optimization method to construct Optimal Randomized Classification Trees (ORCTs), and pay special attention to reducing the number of predictor variables used for splitting nodes, thereby achieving local and global sparsity.
### Problem Background
Traditional decision trees (such as CART) are constructed using a greedy algorithm. Although the training speed is fast, it is difficult to control sparsity, especially in the case of large depths, resulting in poor global sparsity. In recent years, methods based on overall optimization (such as optimal decision trees) have shown better learning performance, especially when using oblique cuts. However, these methods usually sacrifice interpretability because all predictor variables may appear in each branch rule.
### Paper Objectives
To meet this challenge, the authors propose a new optimization method to simultaneously achieve local and global sparsity by introducing regularization terms. Specifically, they use the following formulas:
- **Local Sparsity**: Achieved through $\ell_1$-norm regularization.
- **Global Sparsity**: Achieved through $\ell_\infty$-norm regularization.
The optimization problem can be expressed as:
\[
\min g(a, \mu, C)+\lambda_L \sum_{j = 1}^p \|a_{j\cdot}\|_1+\lambda_G \sum_{j = 1}^p \|a_{j\cdot}\|_\infty
\]
where:
- $g(a, \mu, C)$ is the expected misclassification cost on the training samples.
- $\lambda_L$ and $\lambda_G$ are the regularization parameters for local and global sparsity respectively.
- $a_{j\cdot}$ is the coefficient vector of the $j$-th predictor variable on all branch nodes.
### Main Contributions
1. **Sparsity Control**: By introducing appropriate regularization terms, it is possible to improve local and global sparsity without affecting classification accuracy.
2. **Optimization Method**: A continuous optimization framework is proposed, which can be solved by standard continuous optimization solvers.
3. **Experimental Verification**: Through experiments on multiple real - world datasets, the effectiveness of the proposed method is verified, and the advantages compared to classical methods (such as CART) are demonstrated.
In summary, this paper aims to improve the sparsity and interpretability of classification trees through optimization methods, so that the model is easier to understand and apply while maintaining high classification performance.