A modified random forest approach to improve multi-class classification performance of tobacco leaf grades coupled with NIR spectroscopy

Jun Bin,Fangfang Ai,Wei Fan,Jiheng Zhou,Yong-Huan Yun,Yi-Zeng Liang
DOI: https://doi.org/10.1039/c5ra25052h
IF: 4.036
2016-01-01
RSC Advances
Abstract:To select informative variables for improving the ensemble performance in random forests (RF), a modified RF method, named random forest combined with Monte Carlo and uninformative variable elimination (MC-UVE-RF), is proposed for multi-class classification analysis of near-infrared (NIR) spectroscopy in this work. The MC method is used to increase the diversity of classification trees in RF and the UVE method is applied to gradually eliminate the less important variables based on variable reliability obtained by aggregation of each sub-model. The above two steps can be regarded as a variable selection process. As comparisons to MC-UVE-RF, the conventional RF, model population analysis combined with RF (MPA-RF) and support vector machine (SVM) for discrimination of tobacco grades by NIR spectroscopy have also been investigated. MC-UVE-RF has a marked superiority for discriminating tobacco samples into high-quality, medium-quality and low-quality groups of dataset I and II with external validation accuracy 100% and 96.83%, respectively (coarse classification). Furthermore, a good external validation accuracy in the subdivision of high-quality, medium-quality and low-quality groups of dataset I is 88.46%, 97.22% and 96%, and that of the subdivision of dataset II's three groups is 100%, 97.14% and 100%, respectively, which are better than or equal to those by other methods (refined classification). Therefore, MC-UVE-RF is a powerful alternative to multiple classification problems. Moreover, it could be a fast and powerful method for discrimination of tobacco leaf grades coupled with NIR technology instead of artificial judgment.
What problem does this paper attempt to address?