Fast stepwise regression based on multidimensional indexes

Barbara Żogała-Siudem,Szymon Jaroszewicz
DOI: https://doi.org/10.1016/j.ins.2020.11.031
IF: 8.1
2021-03-01
Information Sciences
Abstract:We present an approach to efficiently construct stepwise regression models in a very high dimensional setting using a multidimensional index. The approach is based on an observation that the collections of available predictor variables often remain relatively stable and many models are built based on the same predictors. Example scenarios include data warehouses against which multiple ad-hoc analytical models are built or collections of publicly available open data which remain relatively fixed and are used as a source of predictor variables for many models. We propose an approach where the user simply provides a target variable and the algorithm uses a pre-built multidimensional index to automatically select predictors from millions of available variables, yielding results identical to standard stepwise regression, but an order of magnitude faster. The algorithm has been tested on the large statistical database available from Eurostat, and has been demonstrated to produce interpretable and accurate models. We demonstrate experimentally that our approach produces results that are significantly better than other approaches to modeling with ultra-high dimensional data. Finally, we discuss potential pitfalls such as the presence of highly correlated variables, and show how they can be overcome.
computer science, information systems
What problem does this paper attempt to address?