A New Strategy to Prevent Over-Fitting in Partial Least Squares Models Based on Model Population Analysis

Bai-Chuan Deng,Yong-Huan Yun,Yi-Zeng Liang,Dong-Sheng Cao,Qing-Song Xu,Lun-Zhao Yi,Xin Huang
DOI: https://doi.org/10.1016/j.aca.2015.04.045
IF: 6.911
2015-01-01
Analytica Chimica Acta
Abstract:Partial least squares (PLS) is one of the most widely used methods for chemical modeling. However, like many other parameter tunable methods, it has strong tendency of over-fitting. Thus, a crucial step in PLS model building is to select the optimal number of latent variables (nLVs). Cross-validation (CV) is the most popular method for PLS model selection because it selects a model from the perspective of prediction ability. However, a clear minimum of prediction errors may not be obtained in CV which makes the model selection difficult. To solve the problem, we proposed a new strategy for PLS model selection which combines the cross-validated coefficient of determination (Qcv(2)) and model stability (S). S is defined as the stability of PLS regression vectors which is obtained using model population analysis (MPA). The results show that, when a clear maximum of Qcv(2) is not obtained, S can provide additional information of over-fitting and it helps in finding the optimal nLVs. Compared with other regression vector based indictors such as the Euclidean 2-norm (B2), the Durbin Watson statistic (DW) and the jaggedness (J), S is more sensitive to over-fitting. The model selected by our method has both good prediction ability and stability.
What problem does this paper attempt to address?