Rule-Enhanced Penalized Regression by Column Generation using Rectangular Maximum Agreement
Jonathan Eckstein,Noam Goldberg,Ai Kagawa
2017-07-17
Abstract:We describe a procedure enhancingL1-penalized regression by adding dynamically generated rules describing multidimensional “box” sets. Our rule-adding procedure is based on the classical column generation method for highdimensional linear programming. The pricing problem for our column generation procedure reduces to the NP-hard rectangular maximum agreement (RMA) problem of finding a box that best discriminates between two weighted datasets. We solve this problem exactly using a parallel branch-and-bound procedure. The resulting rule-enhanced regression method is computation-intensive, but has promising prediction performance. 1. Motivation and Overview This paper considers the general learning problem in which we have m observation vectors X1, . . . , Xm ∈ R, with matching response values y1, . . . , ym ∈ R. Each response yi is a possibly noisy evaluation of an unknown function f : R → R at Xi, that is, yi = f(Xi) + ei, where ei ∈ R represents the noise or measurement error. The goal is to estimate f by some f̂ : R → R such that f̂(Xi) is a good fit for yi, that is, |f̂(Xi) − yi| tends to be small. The estimate f̂ may then be used to predict the response value y corresponding to a newly encountered observation x ∈ R through the prediction ŷ = f̂(x). A classical linear regression model is one simple example of the many possible techniques one might employ for constructing f̂ . The classical regression approach to this problem is to posit Management Science and Information Systems, Rutgers University, Piscataway, NJ, USA Department of Management, Bar-Ilan University, Ramat Gan, Israel Doctoral Program in Operations Research, Rutgers University, Piscataway, NJ, USA. Correspondence to: Jonathan Eckstein <jeckstei@business.rutgers.edu>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). a particular functional form for f̂(x) (for example, an affine function of x) and then use an optimization procedure to estimate the parameters in this functional form. Here, we are interested in cases in which a concise candidate functional form for f̂ is not readily apparent, and we wish to estimate f̂ by searching over a very highdimensional space of parameters. For example, Breiman (2001) proposed the method of random forests, which constructs f̂ by training regression trees on multiple random subsamples of the data, and then averaging the resulting predictors. Another proposal is the RuleFit algorithm (Friedman & Popescu, 2008), which enhances L1regularized regression by generating box-based rules to use as additional explanatory variables. Given a, b ∈ R with a ≤ b, the rule function r(a,b) : R → {0, 1} is given by r(a,b)(x) = I ( ∧j∈{1,...,n}(aj ≤ xj ≤ bj) ) , (1) that is r(a,b)(x) = 1 if a ≤ x ≤ b (componentwise) and r(a,b)(x) = 0 otherwise. RuleFit generates rules through a two-phase procedure: first, it determines a regression tree ensemble, and then decomposes these trees into rules and determines the regression model coefficients (including for the rules). The approach of Dembczyński et al. (2008a) generates rules more directly (without having to rely on an initial ensemble of decision trees) within gradient boosting (Friedman, 2001) for non-regularized regression. In this scheme, a greedy procedure generates the rules within a gradient descent method runs that for a predetermined number of iterations. Aho et al. (2012) extended the RuleFit method to solve more general multi-target regression problems. For the special case of single-target regression, however, their experiments suggest that random forests and RuleFit outperform several other methods, including their own extended implementation and the algorithm of Dembczyński et al. (2008a). Compared with random forests and other popular learning approaches such as kernel-based methods and neural networks, rule-based approaches have the advantage of generally being considered more accessible and easier to interpret by domain experts. Rule-based methods also have a considerable history in classification settings, as in for example Weiss & Indurkhya (1993), Cohen & Singer Rule-Enhanced Penalized Regression by Column Generation using Rectangular Maximum Agreement (1999), and Dembczyński et al. (2008b). Here, we propose an iterative optimization-based regression procedure called REPR (Rule-Enhanced Penalized Regression). Its output models resemble those of RuleFit, but our methodology draws more heavily on exact optimization techniques from the field of mathematical programming. While it is quite computationally intensive, its prediction performance appears promising. As in RuleFit, we start with a linear regression model (in this case, with L1-penalized coefficients to promote sparsity), and enhance it by synthesizing rules of the form (1). We incrementally adjoin such rules to our (penalized) linear regression model as if they were new observation variables. Unlike RuleFit, we control the generation of new rules using the classical mathematical programming technique of column generation. Our employment of column generation roughly resembles its use in the LPBoost ensemble classification method of Demiriz et al. (2002). Column generation involves cyclical alternation between optimization of a restricted master problem (in our case a linear or convex quadratic program) and a pricing problem that finds the most promising new variables to adjoin to the formulation. In our case, the pricing problem is equivalent to an NP-hard combinatorial problem we call Rectangular Maximum Agreement (RMA), which generalizes the Maximum Mononial Agreement (MMA) problem as formulated and solved by Eckstein & Goldberg (2012). We solve the RMA problem by a similar branch-and-bound method procedure, implemented using parallel computing techniques. To make our notation below more concise, we let X denote the matrix whose rows are X 1 , . . . , X > m, and also let y = (y1, . . . , ym) ∈ R. We may then express a problem instance by the pair (X, y). We also let xij denote the (i, j)th element of this matrix, that is, the value of variable j in observation i. 2. A Penalized Regression Model with Rules Let K be a set of pairs (a, b) ∈ R × R with a ≤ b, constituting a catalog of all the possible rules of the form (1) that we wish to be available to our regression model. The set K will typically be extremely large: restricting each aj and bj to values that appear as xij for some i, which is sufficient to describe all possible distinct behaviors of rules of the form (1) on the dataset X , there are still ∏n j=1 `j(`j + 1)/2 ≥ 3 possible choices for (a, b), where `j = | ⋃m i=1{xij}| is the number of distinct values for xij . The predictors f̂ that our method constructs are of the form f̂(x) = β0 + n ∑