Variable Selection in High-dimensional Data: Method Comparison and its Application in Tax Assessment

Wu Wuqing,Wang Chengjie,Jiang Yong,Chen Min
DOI: https://doi.org/10.14120/j.cnki.cn11-5057/f.2013.08.009
2013-01-01
Management Review
Abstract:When the number of candidate predictor variables(p) is greater than the sample size(n) in linear regression,especially if p>>n,a lot of classical statistical inference might be invalid.Therefore,it is necessary to do the theoretical and empirical research of high-dimensional data analysis techniques.This article discusses three new problems that would be encountered in high-dimensional data analysis,and introduced six variable selection methods such as SIS and LASSO.At the simulation part,five evaluation criteria are chosen to compare the variable selecting effect of the above six methods.After comparison,it is found that the p/n ratio is related to variable selecting effect: when the p/n ratio is high,the best method is SIS,and as the ratio reduces,especially as the p/n ratio satisfies the condition of p <n,the effects of the above five methods except the square-root LASSO are beginning to converge.In the tax assessment,industry segmentation will generally improve the effect of assessment,but the segmentation will cause the number of candidate predictor variables become greater than the sample size.So it is needed to resort to the variable selection techniques in highdimensional data.In this paper the SIS method is employed to model the VAT input tax of 13 subdivided industries in one city.The results indicate that SIS method has the significant variable selecting effect.
What problem does this paper attempt to address?